rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 05:30:38 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	12850dd5e9	proxy: remove dead code (#8847 ) By marking everything possible as pub(crate), we find a few dead code candidates.	2024-08-27 12:00:35 +01:00
Folke Behrens	d6eede515a	proxy: clippy lints: handle some low hanging fruit (#8829 ) Should be mostly uncontroversial ones.	2024-08-26 15:16:54 +02:00
Conrad Ludgate	701cb61b57	proxy: local auth backend (#8806 ) Adds a Local authentication backend. Updates http to extract JWT bearer tokens and passes them to the local backend to validate.	2024-08-23 18:48:06 +00:00
Folke Behrens	f246aa3ca7	proxy: Fix some warnings by extended clippy checks (#8748 ) * Missing blank lifetimes which is now deprecated. * Matching off unqualified enum variants that could act like variable. * Missing semicolons.	2024-08-19 10:33:46 +02:00
Conrad Ludgate	ad0988f278	proxy: random changes (#8602 ) ## Problem 1. Hard to correlate startup parameters with the endpoint that provided them. 2. Some configurations are not needed in the `ProxyConfig` struct. ## Summary of changes Because of some borrow checker fun, I needed to switch to an interior-mutability implementation of our `RequestMonitoring` context system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock for such a use-case (needed to be thread safe). Removed the lock of each startup message, instead just logging only the startup params in a successful handshake. Also removed from values from `ProxyConfig` and kept as arguments. (needed for local-proxy config)	2024-08-07 14:37:03 +01:00
Conrad Ludgate	6ca41d3438	proxy: switch to leaky bucket (#8470 ) ## Problem The current bucket based rate limiter is not very intuitive and has some bad failure cases. ## Summary of changes Switches from fixed interval buckets to leaky bucket impl. A single bucket per endpoint, drains over time. Drains by checking the time since the last check, and draining tokens en-masse. Garbage collection works similar to before, it drains a shard (1/64th of the set) every 2048 checks, and it only removes buckets that are empty. To be compatible with the existing config, I've faffed to make it take the min and the max rps of each as the sustained rps and the max bucket size which should be roughly equivalent.	2024-07-24 12:28:37 +01:00
Christian Schwarz	3f7aebb01c	refactor: postgres_backend: replace abstract shutdown_watcher with CancellationToken (#8295 ) Preliminary refactoring while working on https://github.com/neondatabase/neon/issues/7427 and specifically https://github.com/neondatabase/neon/pull/8286	2024-07-09 21:11:11 +03:00
Conrad Ludgate	e03c3c9893	proxy: cache certain non-retriable console errors for a short time (#8201 ) ## Problem If there's a quota error, it makes sense to cache it for a short window of time. Many clients do not handle database connection errors gracefully, so just spam retry 🤡 ## Summary of changes Updates the node_info cache to support storing console errors. Store console errors if they cannot be retried (using our own heuristic. should only trigger for quota exceeded errors).	2024-07-04 09:03:03 +01:00
Conrad Ludgate	6c5d3b5263	proxy fix wake compute console retry (#8141 ) ## Problem 1. Proxy is retrying errors from cplane that shouldn't be retried 2. ~~Proxy is not using the retry_after_ms value~~ ## Summary of changes 1. Correct the could_retry impl for ConsoleError. 2. ~~Update could_retry interface to support returning a fixed wait duration.~~	2024-06-25 18:07:54 +00:00
Anna Khanova	fbccd1e676	Proxy process updated errors (#8026 ) ## Problem Respect errors classification from cplane	2024-06-13 14:42:26 +02:00
Anna Khanova	00032c9d9f	[proxy] Fix dynamic rate limiter (#7950 ) ## Problem There was a bug in dynamic rate limiter, which exhausted CPU in proxy and proxy wasn't able to accept any connections. ## Summary of changes 1. `if self.available > 1` -> `if self.available >= 1` 2. remove `timeout_at` to use just timeout 3. remove potential infinite loops which can exhaust CPUs.	2024-06-04 05:07:54 +01:00
Conrad Ludgate	238fa47bee	proxy fix wake compute rate limit (#7902 ) ## Problem We were rate limiting wake_compute in the wrong place ## Summary of changes Move wake_compute rate limit to after the permit is acquired. Also makes a slight refactor on normalize, as it caught my eye	2024-05-30 11:09:27 +01:00
Conrad Ludgate	c8cebecabf	proxy: reintroduce dynamic limiter for compute lock (#7737 ) ## Problem Computes that are healthy can manage many connection attempts at a time. Unhealthy computes cannot. We initially handled this with a fixed concurrency limit, but it seems this inhibits pgbench. ## Summary of changes Support AIMD for connect_to_compute lock to allow varying the concurrency limit based on compute health	2024-05-29 11:17:05 +01:00
Anna Khanova	be1a88e574	Proxy added per ep rate limiter (#7636 ) ## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).	2024-05-10 12:17:00 +02:00
Conrad Ludgate	e3a2631df9	proxy: do not invalidate cache for permit errors (#7652 ) ## Problem If a permit cannot be acquired to connect to compute, the cache is invalidated. This had the observed affect of sending more traffic to ProxyWakeCompute on cplane. ## Summary of changes Make sure that permit acquire failures are marked as "should not invalidate cache".	2024-05-08 10:33:41 +00:00
Tristan Partin	69337be5c2	Fix grammar in provider.rs error message s/temporary/temporarily --------- Co-authored-by: Barry Grenon <barry_grenon@yahoo.ca>	2024-05-06 09:14:42 -05:00
Conrad Ludgate	9b65946566	proxy: add connect compute concurrency lock (#7607 ) ## Problem Too many connect_compute attempts can overwhelm postgres, getting the connections stuck. ## Summary of changes Limit number of connection attempts that can happen at a given time.	2024-05-03 15:45:24 +00:00
Conrad Ludgate	a54ea8fb1c	proxy: move endpoint rate limiter (#7413 ) ## Problem ## Summary of changes Rate limit for wake_compute calls	2024-04-18 06:00:33 +01:00
Anna Khanova	40f15c3123	Read cplane events from regional redis (#7352 ) ## Problem Actually read redis events. ## Summary of changes This is revert of https://github.com/neondatabase/neon/pull/7350 + fixes. * Fixed events parsing * Added timeout after connection failure * Separated regional and global redis clients.	2024-04-11 18:24:34 +00:00
Conrad Ludgate	5299f917d6	proxy: replace prometheus with measured (#6717 ) ## Problem My benchmarks show that prometheus is not very good. https://github.com/conradludgate/measured We're already using it in storage_controller and it seems to be working well. ## Summary of changes Replace prometheus with my new measured crate in proxy only. Apologies for the large diff. I tried to keep it as minimal as I could. The label types add a bit of boiler plate (but reduce the chance we mistype the labels), and some of our custom metrics like CounterPair and HLL needed to be rewritten.	2024-04-11 16:26:01 +00:00
Anna Khanova	0bb04ebe19	Revert "Proxy read ids from redis (#7205 )" (#7350 ) This reverts commit `dbac2d2c47`. ## Problem Proxy pods fails to install in k8s clusters, cplane release blocking. ## Summary of changes Revert	2024-04-10 10:12:55 +00:00
Anna Khanova	5efe95a008	proxy: fix credentials cache lookup (#7349 ) ## Problem Incorrect processing of `-pooler` connections. ## Summary of changes Fix TODO: add e2e tests for caching	2024-04-10 08:30:09 +00:00
Anna Khanova	dbac2d2c47	Proxy read ids from redis (#7205 ) ## Problem Proxy doesn't know about existing endpoints. ## Summary of changes * Added caching of all available endpoints. * On the high load, use it before going to cplane. * Report metrics for the outcome. * For rate limiter and credentials caching don't distinguish between `-pooled` and not TODOs: * Make metrics more meaningful * Consider integrating it with the endpoint rate limiter * Test it together with cplane in preview	2024-04-10 02:40:14 +02:00
Conrad Ludgate	55da8eff4f	proxy: report metrics based on cold start info (#7324 ) ## Problem Would be nice to have a bit more info on cold start metrics. ## Summary of changes * Change connect compute latency to include `cold_start_info`. * Update `ColdStartInfo` to include HttpPoolHit and WarmCached. * Several changes to make more use of interned strings	2024-04-05 16:14:50 +01:00
Anna Khanova	6c18109734	proxy: reuse sess_id as request_id for the cplane requests (#7245 ) ## Problem https://github.com/neondatabase/cloud/issues/11599 ## Summary of changes Reuse the same sess_id for requests within the one session. TODO: get rid of `session_id` in query params.	2024-03-26 11:27:48 +00:00
Conrad Ludgate	77f3a30440	proxy: unit tests for auth_quirks (#7199 ) ## Problem I noticed code coverage for auth_quirks was pretty bare ## Summary of changes Adds 3 happy path unit tests for auth_quirks * scram * cleartext (websockets) * cleartext (password hack)	2024-03-22 13:31:10 +00:00
Anna Khanova	46098ea0ea	proxy: add more missing warm logging (#7133 ) ## Problem There is one more missing thing about cached connections for `cold_start_info`. ## Summary of changes Fix and add comments.	2024-03-15 11:13:15 +00:00
Anna Khanova	b0aff04157	proxy: add new dimension to exclude cplane latency (#7011 ) ## Problem Currently cplane communication is a part of the latency monitoring. It doesn't allow to setup the proper alerting based on proxy latency. ## Summary of changes Added dimension to exclude cplane latency.	2024-03-13 13:50:05 +01:00
Anna Khanova	0554bee022	proxy: Report warm cold start if connection is from the local cache (#7104 ) ## Problem * quotes in serialized string * no status if connection is from local cache ## Summary of changes * remove quotes * report warm if connection if from local cache	2024-03-13 11:45:19 +00:00
Conrad Ludgate	cc5d6c66b3	proxy: categorise new cplane error message (#7057 ) ## Problem `422 Unprocessable Entity: compute time quota of non-primary branches is exceeded` being marked as a control plane error. ## Summary of changes Add the manual checks to make this a user error that should not be retried.	2024-03-11 09:20:09 +01:00
Anna Khanova	15b3665dc4	proxy: fix bug with populating the data (#7023 ) ## Problem Branch/project and coldStart were not populated to data events. ## Summary of changes Populate it. Also added logging for the coldstart info.	2024-03-05 15:32:58 +00:00
Anna Khanova	3114be034a	proxy: change is cold start to enum (#6948 ) ## Problem Actually it's good idea to distinguish between cases when it's a cold start, but we took the compute from the pool ## Summary of changes Updated to enum.	2024-03-04 10:31:28 +01:00
Arpad Müller	82853cc1d1	Fix warnings and compile errors on nightly (#6886 ) Nightly has added a bunch of compiler and linter warnings. There is also two dependencies that fail compilation on latest nightly due to using the old `stdsimd` feature name. This PR fixes them.	2024-03-01 17:14:19 +01:00
Anna Khanova	896d51367e	proxy: introdice is cold start for analytics (#6902 ) ## Problem Data team cannot distinguish between cold start and not cold start. ## Summary of changes Report `is_cold_start` to analytics. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2024-02-27 19:53:02 +04:00
Conrad Ludgate	e0af945f8f	proxy: improve error classification (#6841 ) ## Problem ## Summary of changes 1. Classify further cplane API errors 2. add 'serviceratelimit' and make a few of the timeout errors return that. 3. a few additional minor changes	2024-02-21 10:04:09 +00:00
Anna Khanova	fac50a6264	Proxy refactor auth+connect (#6708 ) ## Problem Not really a problem, just refactoring. ## Summary of changes Separate authenticate from wake compute. Do not call wake compute second time if we managed to connect to postgres or if we got it not from cache.	2024-02-12 18:41:02 +00:00
Conrad Ludgate	96d89cde51	Proxy error reworking (#6453 ) ## Problem Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and doing a bit less radical changes. smaller commits. We currently don't report error classifications in proxy as the current error handling made it hard to do so. ## Summary of changes 1. Add a `ReportableError` trait that all errors will implement. This provides the error classification functionality. 2. Handle Client requests a strongly typed error * this error is a `ReportableError` and is logged appropriately 3. The handle client error only has a few possible error types, to account for the fact that at this point errors should be returned to the user.	2024-02-09 15:50:51 +00:00
Anna Khanova	c63e3e7e84	Proxy: improve http-pool (#6577 ) ## Problem The password check logic for the sql-over-http is a bit non-intuitive. ## Summary of changes 1. Perform scram auth using the same logic as for websocket cleartext password. 2. Split establish connection logic and connection pool. 3. Parallelize param parsing logic with authentication + wake compute. 4. Limit the total number of clients	2024-02-08 12:57:05 +01:00
Conrad Ludgate	6506fd14c4	proxy: more refactors (#6526 ) ## Problem not really any problem, just some drive-by changes ## Summary of changes 1. move wake compute 2. move json processing 3. move handle_try_wake 4. move test backend to api provider 5. reduce wake-compute concerns 6. remove duplicate wake-compute loop	2024-02-02 16:07:35 +00:00
Conrad Ludgate	0856fe6676	proxy: remove per client bytes (#5466 ) ## Problem Follow up to #5461 In my memory usage/fragmentation measurements, these metrics came up as a large source of small allocations. The replacement metric has been in use for a long time now so I think it's good to finally remove this. Per-endpoint data is still tracked elsewhere ## Summary of changes remove the per-client bytes metrics	2024-02-02 12:28:48 +00:00
Anna Khanova	271133d960	Proxy: reduce number of get role secret calls (#6557 ) ## Problem Right now if get_role_secret response wasn't cached (e.g. cache already reached max size) it will send the second (exactly the same request). ## Summary of changes Avoid needless request.	2024-01-31 22:16:56 +00:00
Conrad Ludgate	210700d0d9	proxy: add newtype wrappers for string based IDs (#6445 ) ## Problem too many string based IDs. easy to mix up ID types. ## Summary of changes Add a bunch of `SmolStr` wrappers that provide convenience methods but are type safe	2024-01-24 16:38:10 +00:00
Conrad Ludgate	e03f8abba9	eager parsing of ip addr (#6446 ) ## Problem Parsing the IP address at check time is a little wasteful. ## Summary of changes Parse the IP when we get it from cplane. Adding a `None` variant to still allow malformed patterns	2024-01-23 13:25:01 +00:00
Anna Khanova	1905f0bced	proxy: store role not found in cache (#6439 ) ## Problem There are a lot of responses with 404 role not found error, which are not getting cached in proxy. ## Summary of changes If there was returned an empty secret but with the project_id, store it in cache.	2024-01-23 13:15:05 +01:00
Conrad Ludgate	34ddec67d9	proxy small tweaks (#6398 ) ## Problem In https://github.com/neondatabase/neon/pull/6283 I did a couple changes that weren't directly related to the goal of extracting the state machine, so I'm putting them here ## Summary of changes - move postgres vs console provider into another enum - reduce error cases for link auth - slightly refactor link flow	2024-01-21 09:58:42 +01:00
Anna Khanova	9ace36d93c	Proxy: do not store empty key (#6415 ) ## Problem Currently we store in cache even if the project is undefined. That makes invalidation impossible. ## Summary of changes Do not store if project id is empty.	2024-01-20 16:14:53 +00:00
Conrad Ludgate	551f0cc097	proxy: refactor how neon-options are handled (#6306 ) ## Problem HTTP connection pool was not respecting the PitR options. ## Summary of changes 1. refactor neon_options a bit to allow easier access to cache_key 2. make HTTP not go through `StartupMessageParams` 3. expose SNI processing to replace what was removed in step 2.	2024-01-11 14:58:31 +00:00
Anna Khanova	76372ce002	Added auth info cache with notifiations to redis. (#6208 ) ## Problem Current cache doesn't support any updates from the cplane. ## Summary of changes * Added redis notifier listner. * Added cache which can be invalidated with the notifier. If the notifier is not available, it's just a normal ttl cache. * Updated cplane api. The motivation behind this organization of the data is the following: * In the Neon data model there are projects. Projects could have multiple branches and each branch could have more than one endpoint. * Also there is one special `main` branch. * Password reset works per branch. * Allowed IPs are the same for every branch in the project (except, maybe, the main one). * The main branch can be changed to the other branch. * The endpoint can be moved between branches. Every event described above requires some special processing on the porxy (or cplane) side. The idea of invalidating for the project is that whenever one of the events above is happening with the project, proxy can invalidate all entries for the entire project. This approach also requires some additional API change (returning project_id inside the auth info).	2024-01-10 11:51:05 +00:00
Conrad Ludgate	8a646cb750	proxy: add request context for observability and blocking (#6160 ) ## Summary of changes ### RequestMonitoring We want to add an event stream with information on each request for easier analysis than what we can do with diagnostic logs alone (https://github.com/neondatabase/cloud/issues/8807). This RequestMonitoring will keep a record of the final state of a request. On drop it will be pushed into a queue to be uploaded. Because this context is a bag of data, I don't want this information to impact logic of request handling. I personally think that weakly typed data (such as all these options) makes for spaghetti code. I will however allow for this data to impact rate-limiting and blocking of requests, as this does not _really_ change how a request is handled. ### Parquet Each `RequestMonitoring` is flushed into a channel where it is converted into `RequestData`, which is accumulated into parquet files. Each file will have a certain number of rows per row group, and several row groups will eventually fill up the file, which we then upload to S3. We will also upload smaller files if they take too long to construct.	2024-01-08 11:42:43 +00:00
Conrad Ludgate	1c037209c7	proxy: fix compute addr parsing (#6237 ) ## Problem control plane should be able to return domain names and not just IP addresses. ## Summary of changes 1. add regression tests 2. use rsplit to split the port from the back, then trim the ipv6 brackets	2023-12-29 09:32:24 +00:00

1 2

86 Commits