rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-05 04:22:56 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	60e5a56a5a	proxy: include client IP in ip deny message (#6854 ) ## Problem Debugging IP deny errors is difficult for our users ## Summary of changes Include the client IP in the deny message	2024-02-21 18:24:59 +01:00
Conrad Ludgate	e0af945f8f	proxy: improve error classification (#6841 ) ## Problem ## Summary of changes 1. Classify further cplane API errors 2. add 'serviceratelimit' and make a few of the timeout errors return that. 3. a few additional minor changes	2024-02-21 10:04:09 +00:00
Conrad Ludgate	21a86487a2	proxy: fix #6529 (#6807 ) ## Problem `application_name` for HTTP is not being recorded ## Summary of changes get `application_name` query param	2024-02-20 11:58:01 +01:00
Conrad Ludgate	686b3c79c8	http2 alpn (#6815 ) ## Problem Proxy already supported HTTP2, but I expect no one is using it because we don't advertise it in the TLS handshake. ## Summary of changes #6335 without the websocket changes.	2024-02-20 10:44:46 +00:00
Conrad Ludgate	d0d4871682	proxy: use postgres_protocol scram/sasl code (#4748 ) 1) `scram::password` was used in tests only. can be replaced with `postgres_protocol::password`. 2) `postgres_protocol::authentication::sasl` provides a client impl of SASL which improves our ability to test	2024-02-19 12:54:17 +00:00
Joonas Koivunen	80854b98ff	move timeouts and cancellation handling to remote_storage (#6697 ) Cancellation and timeouts are handled at remote_storage callsites, if they are. However they should always be handled, because we've had transient problems with remote storage connections. - Add cancellation token to the `trait RemoteStorage` methods - For `download`, `list` methods there is `DownloadError::{Cancelled,Timeout}` - For the rest now using `anyhow::Error`, it will have root cause `remote_storage::TimeoutOrCancel::{Cancel,Timeout}` - Both types have `::is_permanent` equivalent which should be passed to `backoff::retry` - New generic RemoteStorageConfig option `timeout`, defaults to 120s - Start counting timeouts only after acquiring concurrency limiter permit - Cancellable permit acquiring - Download stream timeout or cancellation is communicated via an `std::io::Error` - Exit backoff::retry by marking cancellation errors permanent Fixes: #6096 Closes: #4781 Co-authored-by: arpad-m <arpad-m@users.noreply.github.com>	2024-02-14 23:24:07 +00:00
Anna Khanova	c7538a2c20	Proxy: remove fail fast logic to connect to compute (#6759 ) ## Problem Flaky tests ## Summary of changes Remove failfast logic	2024-02-14 18:43:52 +00:00
Conrad Ludgate	a9ec4eb4fc	hold cancel session (#6750 ) ## Problem In a recent refactor, we accidentally dropped the cancel session early ## Summary of changes Hold the cancel session during proxy passthrough	2024-02-14 10:26:32 +00:00
Anna Khanova	331935df91	Proxy: send cancel notifications to all instances (#6719 ) ## Problem If cancel request ends up on the wrong proxy instance, it doesn't take an effect. ## Summary of changes Send redis notifications to all proxy pods about the cancel request. Related issue: https://github.com/neondatabase/neon/issues/5839, https://github.com/neondatabase/cloud/issues/10262	2024-02-13 17:58:58 +01:00
Anna Khanova	fac50a6264	Proxy refactor auth+connect (#6708 ) ## Problem Not really a problem, just refactoring. ## Summary of changes Separate authenticate from wake compute. Do not call wake compute second time if we managed to connect to postgres or if we got it not from cache.	2024-02-12 18:41:02 +00:00
Conrad Ludgate	789a71c4ee	proxy: add more http logging (#6726 ) ## Problem hard to see where time is taken during HTTP flow. ## Summary of changes add a lot more for query state. add a conn_id field to the sql-over-http span	2024-02-12 15:03:45 +00:00
Conrad Ludgate	98ec5c5c46	proxy: some more parquet data (#6711 ) ## Summary of changes add auth_method and database to the parquet logs	2024-02-12 13:14:06 +00:00
Anna Khanova	020e607637	Proxy: copy bidirectional fork (#6720 ) ## Problem `tokio::io::copy_bidirectional` doesn't close the connection once one of the sides closes it. It's not really suitable for the postgres protocol. ## Summary of changes Fork `copy_bidirectional` and initiate a shutdown for both connections. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2024-02-12 14:04:46 +01:00
Conrad Ludgate	cbd3a32d4d	proxy: decode username and password (#6700 ) ## Problem usernames and passwords can be URL 'percent' encoded in the connection string URL provided by serverless driver. ## Summary of changes Decode the parameters when getting conn info	2024-02-09 19:22:23 +00:00
Conrad Ludgate	96d89cde51	Proxy error reworking (#6453 ) ## Problem Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and doing a bit less radical changes. smaller commits. We currently don't report error classifications in proxy as the current error handling made it hard to do so. ## Summary of changes 1. Add a `ReportableError` trait that all errors will implement. This provides the error classification functionality. 2. Handle Client requests a strongly typed error * this error is a `ReportableError` and is logged appropriately 3. The handle client error only has a few possible error types, to account for the fact that at this point errors should be returned to the user.	2024-02-09 15:50:51 +00:00
Conrad Ludgate	ea089dc977	proxy: add per query array mode flag (#6678 ) ## Problem Drizzle needs to be able to configure the array_mode flag per query. ## Summary of changes Adds an array_mode flag to the query data json that will otherwise default to the header flag.	2024-02-09 10:29:20 +00:00
Anna Khanova	6c34d4cd14	Proxy: set timeout on establishing connection (#6679 ) ## Problem There is no timeout on the handshake. ## Summary of changes Set the timeout on the establishing connection.	2024-02-08 13:52:04 +00:00
Anna Khanova	c63e3e7e84	Proxy: improve http-pool (#6577 ) ## Problem The password check logic for the sql-over-http is a bit non-intuitive. ## Summary of changes 1. Perform scram auth using the same logic as for websocket cleartext password. 2. Split establish connection logic and connection pool. 3. Parallelize param parsing logic with authentication + wake compute. 4. Limit the total number of clients	2024-02-08 12:57:05 +01:00
Joonas Koivunen	947165788d	refactor: needless cancellation token cloning (#6618 ) The solution we ended up for `backoff::retry` requires always cloning of cancellation tokens even though there is just `.await`. Fix that, and also turn the return type into `Option<Result<T, E>>` avoiding the need for the `E::cancelled()` fn passed in. Cc: #6096	2024-02-06 09:39:06 +02:00
Conrad Ludgate	74c5e3d9b8	use string interner for project cache (#6578 ) ## Problem Running some memory profiling with high concurrent request rate shows seemingly some memory fragmentation. ## Summary of changes Eventually, we will want to separate global memory (caches) from local memory (per connection handshake and per passthrough). Using a string interner for project info cache helps reduce some of the fragmentation of the global cache by having a single heap dedicated to project strings, and not scattering them throughout all a requests. At the same time, the interned key is 4 bytes vs the 24 bytes that `SmolStr` offers. Important: we should only store verified strings in the interner because there's no way to remove them afterwards. Good for caching responses from console.	2024-02-05 14:27:25 +00:00
Joonas Koivunen	9dd69194d4	refactor(proxy): std::io::Write for BytesMut exists (#6606 ) Replace TODO with an existing implementation via `BufMut::writer``.	2024-02-03 22:15:59 +00:00
Conrad Ludgate	6506fd14c4	proxy: more refactors (#6526 ) ## Problem not really any problem, just some drive-by changes ## Summary of changes 1. move wake compute 2. move json processing 3. move handle_try_wake 4. move test backend to api provider 5. reduce wake-compute concerns 6. remove duplicate wake-compute loop	2024-02-02 16:07:35 +00:00
Conrad Ludgate	0856fe6676	proxy: remove per client bytes (#5466 ) ## Problem Follow up to #5461 In my memory usage/fragmentation measurements, these metrics came up as a large source of small allocations. The replacement metric has been in use for a long time now so I think it's good to finally remove this. Per-endpoint data is still tracked elsewhere ## Summary of changes remove the per-client bytes metrics	2024-02-02 12:28:48 +00:00
Anna Khanova	271133d960	Proxy: reduce number of get role secret calls (#6557 ) ## Problem Right now if get_role_secret response wasn't cached (e.g. cache already reached max size) it will send the second (exactly the same request). ## Summary of changes Avoid needless request.	2024-01-31 22:16:56 +00:00
Conrad Ludgate	c7b02ce8ec	proxy: use jemalloc (#6531 ) ## Summary of changes Experiment with jemalloc in proxy	2024-01-31 14:51:11 +01:00
Conrad Ludgate	ec8dcc2231	flatten proxy flow (#6447 ) ## Problem Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and doing a bit less radical changes. smaller commits. Proxy flow was quite deeply nested, which makes adding more interesting error handling quite tricky. ## Summary of changes I recommend reviewing commit by commit. 1. move handshake logic into a separate file 2. move passthrough logic into a separate file 3. no longer accept a closure in CancelMap session logic 4. Remove connect_to_db, copy logic into handle_client 5. flatten auth_and_wake_compute in authenticate 6. record info for link auth	2024-01-29 17:38:03 +00:00
Conrad Ludgate	511e730cc0	hll experiment (#6312 ) ## Problem Measuring cardinality using logs is expensive and slow. ## Summary of changes Implement a pre-aggregated HyperLogLog-based cardinality estimate. HyperLogLog estimates the cardinality of a set by using the probability that the uniform hash of a value will have a run of n 0s at the end is `1/2^n`, therefore, having observed a run of `n` 0s suggests we have measured `2^n` distinct values. By using multiple shards, we can use the harmonic mean to get a more accurate estimate. We record this into a Prometheus time-series. HyperLogLog counts can be merged by taking the `max` of each shard. We can apply a `max_over_time` in order to find the estimate of cardinality of distinct values over time	2024-01-29 07:26:20 +00:00
Anna Khanova	8253cf1931	proxy: Relax endpoint check (#6503 ) ## Problem http-over-sql allowes host to be in format api.aws.... however it's not the case for the websocket flow. ## Summary of changes Relax endpoint check for the ws serverless connections.	2024-01-28 21:27:14 +00:00
Conrad Ludgate	210700d0d9	proxy: add newtype wrappers for string based IDs (#6445 ) ## Problem too many string based IDs. easy to mix up ID types. ## Summary of changes Add a bunch of `SmolStr` wrappers that provide convenience methods but are type safe	2024-01-24 16:38:10 +00:00
Conrad Ludgate	e03f8abba9	eager parsing of ip addr (#6446 ) ## Problem Parsing the IP address at check time is a little wasteful. ## Summary of changes Parse the IP when we get it from cplane. Adding a `None` variant to still allow malformed patterns	2024-01-23 13:25:01 +00:00
Anna Khanova	1905f0bced	proxy: store role not found in cache (#6439 ) ## Problem There are a lot of responses with 404 role not found error, which are not getting cached in proxy. ## Summary of changes If there was returned an empty secret but with the project_id, store it in cache.	2024-01-23 13:15:05 +01:00
Anna Khanova	3290fb09bf	Proxy: fix gc (#6426 ) ## Problem Gc currently doesn't work properly. ## Summary of changes Change statement on running gc.	2024-01-22 13:24:10 +00:00
Conrad Ludgate	34ddec67d9	proxy small tweaks (#6398 ) ## Problem In https://github.com/neondatabase/neon/pull/6283 I did a couple changes that weren't directly related to the goal of extracting the state machine, so I'm putting them here ## Summary of changes - move postgres vs console provider into another enum - reduce error cases for link auth - slightly refactor link flow	2024-01-21 09:58:42 +01:00
Anna Khanova	9ace36d93c	Proxy: do not store empty key (#6415 ) ## Problem Currently we store in cache even if the project is undefined. That makes invalidation impossible. ## Summary of changes Do not store if project id is empty.	2024-01-20 16:14:53 +00:00
Anna Khanova	f003dd6ad5	Remove rename in parameters (#6411 ) ## Problem Name in notifications is not compatible with console name. ## Summary of changes Rename fields to make it compatible.	2024-01-20 10:20:53 +00:00
Conrad Ludgate	7e7e9f5191	proxy: add more columns to parquet upload (#6405 ) ## Problem Some fields were missed in the initial spec. ## Summary of changes Adds a success boolean (defaults to false unless specifically marked as successful). Adds a duration_us integer that tracks how many microseconds were taken from session start through to request completion.	2024-01-20 09:38:11 +00:00
Anna Khanova	3f2187eb92	Proxy relax sni check (#6323 ) ## Problem Using the same domain name () for serverless driver can help with connection caching. https://github.com/neondatabase/neon/issues/6290 ## Summary of changes Relax SNI check.	2024-01-16 08:42:13 +00:00
Conrad Ludgate	0bac8ddd76	proxy: fix serverless error message info (#6279 ) ## Problem https://github.com/neondatabase/serverless/issues/51#issuecomment-1878677318 ## Summary of changes 1. When we have a db_error, use db_error.message() as the message. 2. include error position. 3. line should be a string (weird?) 4. `datatype` -> `dataType`	2024-01-15 16:43:19 +01:00
Conrad Ludgate	551f0cc097	proxy: refactor how neon-options are handled (#6306 ) ## Problem HTTP connection pool was not respecting the PitR options. ## Summary of changes 1. refactor neon_options a bit to allow easier access to cache_key 2. make HTTP not go through `StartupMessageParams` 3. expose SNI processing to replace what was removed in step 2.	2024-01-11 14:58:31 +00:00
Anna Khanova	a84935d266	Extend unsupported startup parameter error message (#6318 ) ## Problem Unsupported startup parameter error happens with pooled connection. However the reason of this error might not be obvious to the user. ## Summary of changes Send more descriptive message with the link to our troubleshooting page: https://neon.tech/docs/connect/connection-errors#unsupported-startup-parameter. Resolves: https://github.com/neondatabase/neon/issues/6291	2024-01-11 12:09:26 +00:00
Anna Khanova	76372ce002	Added auth info cache with notifiations to redis. (#6208 ) ## Problem Current cache doesn't support any updates from the cplane. ## Summary of changes * Added redis notifier listner. * Added cache which can be invalidated with the notifier. If the notifier is not available, it's just a normal ttl cache. * Updated cplane api. The motivation behind this organization of the data is the following: * In the Neon data model there are projects. Projects could have multiple branches and each branch could have more than one endpoint. * Also there is one special `main` branch. * Password reset works per branch. * Allowed IPs are the same for every branch in the project (except, maybe, the main one). * The main branch can be changed to the other branch. * The endpoint can be moved between branches. Every event described above requires some special processing on the porxy (or cplane) side. The idea of invalidating for the project is that whenever one of the events above is happening with the project, proxy can invalidate all entries for the entire project. This approach also requires some additional API change (returning project_id inside the auth info).	2024-01-10 11:51:05 +00:00
Conrad Ludgate	8a646cb750	proxy: add request context for observability and blocking (#6160 ) ## Summary of changes ### RequestMonitoring We want to add an event stream with information on each request for easier analysis than what we can do with diagnostic logs alone (https://github.com/neondatabase/cloud/issues/8807). This RequestMonitoring will keep a record of the final state of a request. On drop it will be pushed into a queue to be uploaded. Because this context is a bag of data, I don't want this information to impact logic of request handling. I personally think that weakly typed data (such as all these options) makes for spaghetti code. I will however allow for this data to impact rate-limiting and blocking of requests, as this does not _really_ change how a request is handled. ### Parquet Each `RequestMonitoring` is flushed into a channel where it is converted into `RequestData`, which is accumulated into parquet files. Each file will have a certain number of rows per row group, and several row groups will eventually fill up the file, which we then upload to S3. We will also upload smaller files if they take too long to construct.	2024-01-08 11:42:43 +00:00
Conrad Ludgate	1c037209c7	proxy: fix compute addr parsing (#6237 ) ## Problem control plane should be able to return domain names and not just IP addresses. ## Summary of changes 1. add regression tests 2. use rsplit to split the port from the back, then trim the ipv6 brackets	2023-12-29 09:32:24 +00:00
Conrad Ludgate	2df3602a4b	Add GC to http connection pool (#6196 ) ## Problem HTTP connection pool will grow without being pruned ## Summary of changes Remove connection clients from pools once idle, or once they exit. Periodically clear pool shards. GC Logic: Each shard contains a hashmap of `Arc<EndpointPool>`s. Each connection stores a `Weak<EndpointPool>`. During a GC sweep, we take a random shard write lock, and check that if any of the `Arc<EndpointPool>`s are unique (using `Arc::get_mut`). - If they are unique, then we check that the endpoint-pool is empty, and sweep if it is. - If they are not unique, then the endpoint-pool is in active use and we don't sweep. - Idle connections will self-clear from the endpoint-pool after 5 minutes. Technically, the uniqueness of the endpoint-pool should be enough to consider it empty, but the connection count check is done for completeness sake.	2023-12-21 12:00:10 +00:00
Anna Khanova	6e6e40dd7f	Invalidate credentials on auth failure (#6171 ) ## Problem If the user reset password, cache could receive this information only after `ttl` minutes. ## Summary of changes Invalidate password on auth failure.	2023-12-18 23:24:22 +01:00
Anna Khanova	00d90ce76a	Added cache for get role secret (#6165 ) ## Problem Currently if we are getting many consecutive connections to the same user/ep we will send a lot of traffic to the console. ## Summary of changes Cache with ttl=4min proxy_get_role_secret response. Note: this is the temporary hack, notifier listener is WIP.	2023-12-18 16:04:47 +01:00
Conrad Ludgate	17bde7eda5	proxy refactor large files (#6153 ) ## Problem The `src/proxy.rs` file is far too large ## Summary of changes Creates 3 new files: ``` src/metrics.rs src/proxy/retry.rs src/proxy/connect_compute.rs ```	2023-12-18 10:59:49 +00:00
Conrad Ludgate	98629841e0	improve proxy code cov (#6141 ) ## Summary of changes saw some low-hanging codecov improvements. even if code coverage is somewhat of a pointless game, might as well add tests where we can and delete code if it's unused	2023-12-15 12:11:50 +00:00
Conrad Ludgate	cc633585dc	gauge guards (#6138 ) ## Problem The websockets gauge for active db connections seems to be growing more than the gauge for client connections over websockets, which does not make sense. ## Summary of changes refactor how our counter-pair gauges are represented. not sure if this will improve the problem, but it should be harder to mess-up the counters. The API is much nicer though now and doesn't require scopeguard::defer hacks	2023-12-14 17:21:39 +00:00
Conrad Ludgate	6987b5c44e	proxy: add more rates to endpoint limiter (#6130 ) ## Problem Single rate bucket is limited in usefulness ## Summary of changes Introduce a secondary bucket allowing an average of 200 requests per second over 1 minute, and a tertiary bucket allowing an average of 100 requests per second over 10 minutes. Configured by using a format like ```sh proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 100@10s --endpoint-rps-limit 50@1m ``` If the bucket limits are inconsistent, an error is returned on startup ``` $ proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 10@10s Error: invalid endpoint RPS limits. 10@10s allows fewer requests per bucket than 300@1s (100 vs 300) ```	2023-12-13 21:43:49 +00:00

1 2 3 4 5 ...

336 Commits