rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 14:32:57 +00:00

Author	SHA1	Message	Date
George MacKerron	d6fcc18eb2	Add Neon-Batch- headers to OPTIONS response for SQL-over-HTTP requests (#6116 ) This is needed to allow use of batch queries from browsers. ## Problem SQL-over-HTTP batch queries fail from web browsers because the relevant headers, `Neon-Batch-isolation-Level` and `Neon-Batch-Read-Only`, are not included in the server's OPTIONS response. I think we simply forgot to add them when implementing the batch query feature. ## Summary of changes Added `Neon-Batch-Isolation-Level` and `Neon-Batch-Read-Only` to the OPTIONS response.	2023-12-13 17:18:20 +00:00
Conrad Ludgate	c8316b7a3f	simplify endpoint limiter (#6122 ) ## Problem 1. Using chrono for durations only is wasteful 2. The arc/mutex was not being utilised 3. Locking every shard in the dashmap every GC could cause latency spikes 4. More buckets ## Summary of changes 1. Use `Instant` instead of `NaiveTime`. 2. Remove the `Arc<Mutex<_>>` wrapper, utilising that dashmap entry returns mut access 3. Clear only a random shard, update gc interval accordingly 4. Multiple buckets can be checked before allowing access When I benchmarked the check function, it took on average 811ns when multithreaded over the course of 10 million checks.	2023-12-13 13:53:23 +00:00
Stas Kelvich	8460654f61	Add per-endpoint rate limiter to proxy	2023-12-13 07:03:21 +02:00
Anna Khanova	9e071e4458	Propagate information about the protocol to console (#6102 ) ## Problem In snowflake logs currently there is no information about the protocol, that the client uses. ## Summary of changes Propagate the information about the protocol together with the app_name. In format: `{app_name}/{sql_over_http/tcp/ws}`. This will give to @stepashka more observability on what our clients are using.	2023-12-12 11:42:51 +00:00
Andrew Rudenko	df1f8e13c4	proxy: pass neon options in deep object format (#6068 ) --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2023-12-08 19:58:36 +01:00
Conrad Ludgate	e1a564ace2	proxy simplify cancellation (#5916 ) ## Problem The cancellation code was confusing and error prone (as seen before in our memory leaks). ## Summary of changes * Use the new `TaskTracker` primitve instead of JoinSet to gracefully wait for tasks to shutdown. * Updated libs/utils/completion to use `TaskTracker` * Remove `tokio::select` in favour of `futures::future::select` in a specialised `run_until_cancelled()` helper function	2023-12-08 16:21:17 +00:00
Conrad Ludgate	699049b8f3	proxy: make auth more type safe (#5689 ) ## Problem `a5292f7e67/proxy/src/auth/backend.rs (L146-L148)` `a5292f7e67/proxy/src/console/provider/neon.rs (L90)` `a5292f7e67/proxy/src/console/provider/neon.rs (L154)` ## Summary of changes 1. Test backend is only enabled on `cfg(test)`. 2. Postgres mock backend + MD5 auth keys are only enabled on `cfg(feature = testing)` 3. Password hack and cleartext flow will have their passwords validated before proceeding. 4. Distinguish between ClientCredentials with endpoint and without, removing many panics in the process	2023-12-08 11:48:37 +00:00
Conrad Ludgate	f9401fdd31	proxy: fix channel binding error messages (#6054 ) ## Problem For channel binding failed messages we were still saying "channel binding not supported" in the errors. ## Summary of changes Fix error messages	2023-12-07 11:47:16 +00:00
Anna Khanova	c48918d329	Rename metric (#6030 ) ## Problem It looks like because of reallocation of the buckets in previous PR, the metric is broken in graphana. ## Summary of changes Renamed the metric.	2023-12-05 10:03:07 +00:00
Anna Khanova	12f02523a4	Enable dynamic rate limiter (#6029 ) ## Problem Limit the number of open connections between the control plane and proxy. ## Summary of changes Enable dynamic rate limiter in prod. Unfortunately the latency metrics are a bit broken, but from logs I see that on staging for the past 7 days only 2 times latency for acquiring was greater than 1ms (for most of the cases it's insignificant).	2023-12-04 15:00:24 +00:00
Conrad Ludgate	f39fca0049	proxy: chore: replace strings with SmolStr (#5786 ) ## Problem no problem ## Summary of changes replaces boxstr with arcstr as it's cheaper to clone. mild perf improvement. probably should look into other smallstring optimsations tbh, they will likely be even better. The longest endpoint name I was able to construct is something like `ep-weathered-wildflower-12345678` which is 32 bytes. Most string optimisations top out at 23 bytes	2023-11-30 20:52:30 +00:00
Anna Khanova	3657a3c76e	Proxy fix metrics record (#5996 ) ## Problem Some latency metrics are recorded in inconsistent way. ## Summary of changes Make sure that everything is recorded in seconds.	2023-11-30 16:33:54 +00:00
Conrad Ludgate	0c87d1866b	proxy: fix wake_compute error prop (#5989 ) ## Problem fixes #5654 - WakeComputeErrors occuring during a connect_to_compute got propagated as IO errors, which get forwarded to the user as "Couldn't connect to compute node" with no helpful message. ## Summary of changes Handle WakeComputeError during ConnectionError properly	2023-11-30 13:43:21 +00:00
Anna Khanova	e12e2681e9	IP allowlist on the proxy side (#5906 ) ## Problem Per-project IP allowlist: https://github.com/neondatabase/cloud/issues/8116 ## Summary of changes Implemented IP filtering on the proxy side. To retrieve ip allowlist for all scenarios, added `get_auth_info` call to the control plane for: * sql-over-http * password_hack * cleartext_hack Added cache with ttl for sql-over-http path This might slow down a bit, consider using redis in the future. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-11-30 13:14:33 +00:00
Conrad Ludgate	fc77c42c57	proxy: add flag to enable http pool for all users (#5959 ) ## Problem #5123 ## Summary of changes Add `--sql-over-http-pool-opt-in true` default cli arg. Allows us to set `--sql-over-http-pool-opt-in false` region-by-region	2023-11-30 10:19:30 +00:00
Conrad Ludgate	f05d1b598a	proxy: add more db error info (#5951 ) ## Problem https://github.com/neondatabase/serverless/issues/51 ## Summary of changes include more error fields in the json response	2023-11-30 10:18:59 +00:00
Conrad Ludgate	316309c85b	channel binding (#5683 ) ## Problem channel binding protects scram from sophisticated MITM attacks where the attacker is able to produce 'valid' TLS certificates. ## Summary of changes get the tls-server-end-point channel binding, and verify it is correct for the SCRAM-SHA-256-PLUS authentication flow	2023-11-27 21:45:15 +00:00
Conrad Ludgate	a56fd45f56	proxy: fix memory leak again (#5909 ) ## Problem The connections.join_next helped but it wasn't enough... The way I implemented the improvement before was still faulty but it mostly worked so it looked like it was working correctly. From [`tokio::select` docs](https://docs.rs/tokio/latest/tokio/macro.select.html): > 4. Once an <async expression> returns a value, attempt to apply the value to the provided <pattern>, if the pattern matches, evaluate <handler> and return. If the pattern does not match, disable the current branch and for the remainder of the current call to select!. Continue from step 3. The `connections.join_next()` future would complete and `Some(Err(e))` branch would be evaluated but not match (as the future would complete without panicking, we would hope). Since the branch doesn't match, it's disabled. The select continues but never attempts to call `join_next` again. Getting unlucky, more TCP connections are created than we attempt to join_next. ## Summary of changes Replace the `Some(Err(e))` pattern with `Some(e)`. Because of the auto-disabling feature, we don't need the `if !connections.is_empty()` step as the `None` pattern will disable it for us.	2023-11-23 19:11:24 +00:00
khanova	0c243faf96	Proxy log pid hack (#5869 ) ## Problem Improve observability for the compute node. ## Summary of changes Log pid from the compute node. Doesn't work with pgbouncer.	2023-11-16 20:46:23 +00:00
khanova	6b82f22ada	Collect number of connections by sni type (#5867 ) ## Problem We don't know the number of users with the different kind of authentication: ["sni", "endpoint in options" (A and B from [here](https://neon.tech/docs/connect/connection-errors)), "password_hack"] ## Summary of changes Collect metrics by sni kind.	2023-11-16 12:19:13 +00:00
khanova	2f0d245c2a	Proxy control plane rate limiter (#5785 ) ## Problem Proxy might overload the control plane. ## Summary of changes Implement rate limiter for proxy<->control plane connection. Resolves https://github.com/neondatabase/neon/issues/5707 Used implementation ideas from https://github.com/conradludgate/squeeze/	2023-11-15 09:15:59 +00:00
Conrad Ludgate	7cdde285a5	proxy: limit concurrent wake_compute requests per endpoint (#5799 ) ## Problem A user can perform many database connections at the same instant of time - these will all cache miss and materialise as requests to the control plane. #5705 ## Summary of changes I am using a `DashMap` (a sharded `RwLock<HashMap>`) of endpoints -> semaphores to apply a limiter. If the limiter is enabled (permits > 0), the semaphore will be retrieved per endpoint and a permit will be awaited before continuing to call the wake_compute endpoint. ### Important details This dashmap would grow uncontrollably without maintenance. It's not a cache so I don't think an LRU-based reclamation makes sense. Instead, I've made use of the sharding functionality of DashMap to lock a single shard and clear out unused semaphores periodically. I ran a test in release, using 128 tokio tasks among 12 threads each pushing 1000 entries into the map per second, clearing a shard every 2 seconds (64 second epoch with 32 shards). The endpoint names were sampled from a gamma distribution to make sure some overlap would occur, and each permit was held for 1ms. The histogram for time to clear each shard settled between 256-512us without any variance in my testing. Holding a lock for under a millisecond for 1 of the shards does not concern me as blocking	2023-11-09 14:14:30 +00:00
Andrew Rudenko	fc47af156f	Passing neon options to the console (#5781 ) The idea is to pass neon_* prefixed options to control plane. It can be used by cplane to dynamically create timelines and computes. Such options also should be excluded from passing to compute. Another issue is how connection caching is working now, because compute's instance now depends not only on hostname but probably on such options too I included them to cache key.	2023-11-07 16:49:26 +01:00
Joonas Koivunen	4be6bc7251	refactor: remove unnecessary unsafe (#5802 ) unsafe impls for `Send` and `Sync` should not be added by default. in the case of `SlotGuard` removing them does not cause any issues, as the compiler automatically derives those. This PR adds requirement to document the unsafety (see [clippy::undocumented_unsafe_blocks]) and opportunistically adds `#![deny(unsafe_code)]` to most places where we don't have unsafe code right now. TRPL on Send and Sync: https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html [clippy::undocumented_unsafe_blocks]: https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks	2023-11-07 10:26:25 +00:00
Conrad Ludgate	ad5b02e175	proxy: remove unsafe (#5805 ) ## Problem `unsafe {}` ## Summary of changes `CStr` has a method to parse the bytes up to a null byte, so we don't have to do it ourselves.	2023-11-06 17:44:44 +00:00
Conrad Ludgate	cdcaa329bf	proxy: no more statements (#5747 ) ## Problem my prepared statements change in tokio-postgres landed in the latest release. it didn't work as we intended ## Summary of changes https://github.com/neondatabase/rust-postgres/pull/24	2023-11-03 08:30:58 +00:00
khanova	4c7fa12a2a	Proxy introduce allowed ips (#5729 ) ## Problem Proxy doesn't accept wake_compute responses with the allowed IPs. ## Summary of changes Extend wake_compute api to be able to return allowed_ips.	2023-11-02 16:26:15 +00:00
Muhammet Yazici	4f0a8e92ad	fix: Add bearer prefix to Authorization header (#5740 ) ## Problem Some requests with `Authorization` header did not properly set the `Bearer ` prefix. Problem explained here https://github.com/neondatabase/cloud/issues/6390. ## Summary of changes Added `Bearer ` prefix to missing requests.	2023-11-01 09:41:48 +03:00
Conrad Ludgate	d8c21ec70d	fix nightly 1.75 (#5719 ) ## Problem Neon doesn't compile on nightly and had numerous clippy complaints. ## Summary of changes 1. Fixed troublesome dependency 2. Fixed or ignored the lints where appropriate	2023-10-30 16:43:06 +00:00
Conrad Ludgate	964c5c56b7	proxy: dont retry server errors (#5694 ) ## Problem accidental spam ## Summary of changes don't spam control plane if control plane is down :) ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-10-30 08:38:56 +00:00
Arpad Müller	bd59349af3	Fix Rust 1.74 warnings (#5702 ) Fixes new warnings and clippy changes introduced by version 1.74 of the rust compiler toolchain.	2023-10-28 03:47:26 +02:00
Conrad Ludgate	493b47e1da	proxy: exclude client latencies in metrics (#5688 ) ## Problem In #5539, I moved the connect_to_compute latency to start counting before authentication - this is because authentication will perform some calls to the control plane in order to get credentials and to eagerly wake a compute server. It felt important to include these times in the latency metric as these are times we should definitely care about reducing. What is not interesting to record in this metric is the roundtrip time during authentication when we wait for the client to respond. ## Summary of changes Implement a mechanism to pause the latency timer, resuming on drop of the pause struct. We pause the timer right before we send the authentication message to the client, and we resume the timer right after we complete the authentication flow.	2023-10-27 17:17:39 +00:00
duguorong009	39f8fd6945	feat: add `build_tag` env support for `set_build_info_metric` (#5576 ) - Add a new util `project_build_tag` macro, similar to `project_git_version` - Update the `set_build_info_metric` to accept and make use of `build_tag` info - Update all codes which use the `set_build_info_metric`	2023-10-27 10:47:11 +01:00
Conrad Ludgate	71611f4ab3	proxy: prepare to remove high cardinality metrics (#5461 ) ## Problem High cardinality metrics are bad ## Summary of changes Preparing to remove high cardinality metrics. Will actually remove in #5466	2023-10-26 22:54:37 +01:00
Conrad Ludgate	32126d705b	proxy refactor serverless (#4685 ) ## Problem Our serverless backend was a bit jumbled. As a comment indicated, we were handling SQL-over-HTTP in our `websocket.rs` file. I've extracted out the `sql_over_http` and `websocket` files from the `http` module and put them into a new module called `serverless`. ## Summary of changes ```sh mkdir proxy/src/serverless mv proxy/src/http/{conn_pool,sql_over_http,websocket}.rs proxy/src/serverless/ mv proxy/src/http/server.rs proxy/src/http/health_server.rs mv proxy/src/metrics proxy/src/usage_metrics.rs ``` I have also extracted the hyper server and handler from websocket.rs into `serverless.rs`	2023-10-25 15:43:03 +01:00
Conrad Ludgate	a461c459d8	fix http pool test (#5653 ) ## Problem We defer the returning of connections the the connection pool. It's possible for our test to be faster than the returning of connections - which then gets a differing process ID because it opens a new connection. ## Summary of changes 1. Delay the tests just a little (20ms) to give more chance for connections to return. 2. Correlate connection IDs with the connection logs a bit more	2023-10-25 13:20:45 +01:00
Conrad Ludgate	b2c96047d0	move wake compute after the auth quirks logic (#5642 ) ## Problem https://github.com/neondatabase/neon/issues/5568#issuecomment-1777015606 ## Summary of changes Make the auth_quirks_creds return the authentication information, and push the wake_compute loop to after, inside `auth_quirks`	2023-10-25 08:30:47 +01:00
Conrad Ludgate	767ef29390	proxy: filter out more quota exceeded errors (#5640 ) ## Problem Looking at logs, I saw more retries being performed for other quota exceeded errors ## Summary of changes Filter out all quota exceeded family of errors	2023-10-24 13:13:23 +01:00
Conrad Ludgate	94b4e76e13	proxy: latency connect outcome (#5588 ) ## Problem I recently updated the latency timers to include cache miss and pool miss, as well as connection protocol. By moving the latency timer to start before authentication, we count a lot more failures and it's messed up the latency dashboard. ## Summary of changes Add another label to LatencyTimer metrics for outcome. Explicitly report on success	2023-10-23 15:17:28 +01:00
khanova	b514da90cb	Set up timeout for scram protocol execution (#5551 ) ## Problem Context: https://github.com/neondatabase/neon/issues/5511#issuecomment-1759649679 Some of out scram protocol execution timed out only after 17 minutes. ## Summary of changes Make timeout for scram execution meaningful and configurable.	2023-10-23 15:11:05 +01:00
Conrad Ludgate	7d17f1719f	reduce cancel map contention (#5555 ) ## Problem Every database request locks this cancel map rwlock. At high requests per second this would have high contention ## Summary of changes Switch to dashmap which has a sharded rwlock to reduce contention	2023-10-23 14:12:41 +01:00
Conrad Ludgate	543b8153c6	proxy: add flag to reject requests without proxy protocol client ip (#5417 ) ## Problem We need a flag to require proxy protocol (prerequisite for #5416) ## Summary of changes Add a cli flag to require client IP addresses. Error if IP address is missing when the flag is active.	2023-10-17 16:59:35 +01:00
Conrad Ludgate	f775928dfc	proxy: refactor how and when connections are returned to the pool (#5095 ) ## Problem Transactions break connections in the pool fixes #4698 ## Summary of changes * Pool `Client`s are smart object that return themselves to the pool * Pool `Client`s can be 'discard'ed * Pool `Client`s are discarded when certain errors are encountered. * Pool `Client`s are discarded when ReadyForQuery returns a non-idle state.	2023-10-17 13:55:52 +00:00
Arpad Müller	093f8c5f45	Update rust to 1.73.0 (#5574 ) [Release notes](https://blog.rust-lang.org/2023/10/05/Rust-1.73.0.html)	2023-10-17 13:13:12 +01:00
Conrad Ludgate	8c522ea034	proxy: count cache-miss for compute latency (#5539 ) ## Problem Would be good to view latency for hot-path vs cold-path ## Summary of changes add some labels to latency metrics	2023-10-16 16:31:04 +01:00
khanova	21deb81acb	Fix case for array of jsons (#5523 ) ## Problem Currently proxy doesn't handle array of json parameters correctly. ## Summary of changes Added one more level of quotes escaping for the array of jsons case. Resolves: https://github.com/neondatabase/neon/issues/5515	2023-10-12 14:32:49 +02:00
khanova	dbb21d6592	Make http timeout configurable (#5532 ) ## Problem Currently http timeout is hardcoded to 15 seconds. ## Summary of changes Added an option to configure it via cli args. Context: https://neondb.slack.com/archives/C04DGM6SMTM/p1696941726151899	2023-10-12 11:41:07 +02:00
Conrad Ludgate	d4dc86f8e3	proxy: more connection metrics (#5464 ) ## Problem Hard to tell 1. How many clients are connected to proxy 2. How many requests clients are making 3. How many connections are made to a database 1 and 2 are different because of the properties of HTTP. We have 2 already tracked through `proxy_accepted_connections_total` and `proxy_closed_connections_total`, but nothing for 1 and 3 ## Summary of changes Adds 2 new counter gauges. * `proxy_opened_client_connections_total`,`proxy_closed_client_connections_total` - how many client connections are open to proxy * `proxy_opened_db_connections_total`,`proxy_closed_db_connections_total` - how many active connections are made through to a database. For TCP and Websockets, we expect all 3 of these quantities to be roughly the same, barring users connecting but with invalid details. For HTTP: * client_connections/connections can differ because the client connections can be reused. * connections/db_connections can differ because of connection pooling.	2023-10-10 16:33:20 +01:00
Alex Chi Z	5158de70f3	proxy: breakdown wake up failure metrics (#4933 ) ## Problem close https://github.com/neondatabase/neon/issues/4702 ## Summary of changes This PR adds a new metrics for wake up errors and breaks it down by most common reasons (mostly follows the `could_retry` implementation).	2023-10-10 13:17:37 +01:00
khanova	aec9188d36	Added timeout for http requests (#5514 ) # Problem Proxy timeout for HTTP-requests ## Summary of changes If the HTTP-request exceeds 15s, it would be killed. Resolves: https://github.com/neondatabase/neon/issues/4847	2023-10-10 13:39:38 +02:00

1 2 3 4 5 ...

286 Commits