rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 14:32:57 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	316309c85b	channel binding (#5683 ) ## Problem channel binding protects scram from sophisticated MITM attacks where the attacker is able to produce 'valid' TLS certificates. ## Summary of changes get the tls-server-end-point channel binding, and verify it is correct for the SCRAM-SHA-256-PLUS authentication flow	2023-11-27 21:45:15 +00:00
Conrad Ludgate	a56fd45f56	proxy: fix memory leak again (#5909 ) ## Problem The connections.join_next helped but it wasn't enough... The way I implemented the improvement before was still faulty but it mostly worked so it looked like it was working correctly. From [`tokio::select` docs](https://docs.rs/tokio/latest/tokio/macro.select.html): > 4. Once an <async expression> returns a value, attempt to apply the value to the provided <pattern>, if the pattern matches, evaluate <handler> and return. If the pattern does not match, disable the current branch and for the remainder of the current call to select!. Continue from step 3. The `connections.join_next()` future would complete and `Some(Err(e))` branch would be evaluated but not match (as the future would complete without panicking, we would hope). Since the branch doesn't match, it's disabled. The select continues but never attempts to call `join_next` again. Getting unlucky, more TCP connections are created than we attempt to join_next. ## Summary of changes Replace the `Some(Err(e))` pattern with `Some(e)`. Because of the auto-disabling feature, we don't need the `if !connections.is_empty()` step as the `None` pattern will disable it for us.	2023-11-23 19:11:24 +00:00
khanova	0c243faf96	Proxy log pid hack (#5869 ) ## Problem Improve observability for the compute node. ## Summary of changes Log pid from the compute node. Doesn't work with pgbouncer.	2023-11-16 20:46:23 +00:00
khanova	6b82f22ada	Collect number of connections by sni type (#5867 ) ## Problem We don't know the number of users with the different kind of authentication: ["sni", "endpoint in options" (A and B from [here](https://neon.tech/docs/connect/connection-errors)), "password_hack"] ## Summary of changes Collect metrics by sni kind.	2023-11-16 12:19:13 +00:00
khanova	2f0d245c2a	Proxy control plane rate limiter (#5785 ) ## Problem Proxy might overload the control plane. ## Summary of changes Implement rate limiter for proxy<->control plane connection. Resolves https://github.com/neondatabase/neon/issues/5707 Used implementation ideas from https://github.com/conradludgate/squeeze/	2023-11-15 09:15:59 +00:00
Conrad Ludgate	7cdde285a5	proxy: limit concurrent wake_compute requests per endpoint (#5799 ) ## Problem A user can perform many database connections at the same instant of time - these will all cache miss and materialise as requests to the control plane. #5705 ## Summary of changes I am using a `DashMap` (a sharded `RwLock<HashMap>`) of endpoints -> semaphores to apply a limiter. If the limiter is enabled (permits > 0), the semaphore will be retrieved per endpoint and a permit will be awaited before continuing to call the wake_compute endpoint. ### Important details This dashmap would grow uncontrollably without maintenance. It's not a cache so I don't think an LRU-based reclamation makes sense. Instead, I've made use of the sharding functionality of DashMap to lock a single shard and clear out unused semaphores periodically. I ran a test in release, using 128 tokio tasks among 12 threads each pushing 1000 entries into the map per second, clearing a shard every 2 seconds (64 second epoch with 32 shards). The endpoint names were sampled from a gamma distribution to make sure some overlap would occur, and each permit was held for 1ms. The histogram for time to clear each shard settled between 256-512us without any variance in my testing. Holding a lock for under a millisecond for 1 of the shards does not concern me as blocking	2023-11-09 14:14:30 +00:00
Andrew Rudenko	fc47af156f	Passing neon options to the console (#5781 ) The idea is to pass neon_* prefixed options to control plane. It can be used by cplane to dynamically create timelines and computes. Such options also should be excluded from passing to compute. Another issue is how connection caching is working now, because compute's instance now depends not only on hostname but probably on such options too I included them to cache key.	2023-11-07 16:49:26 +01:00
Joonas Koivunen	4be6bc7251	refactor: remove unnecessary unsafe (#5802 ) unsafe impls for `Send` and `Sync` should not be added by default. in the case of `SlotGuard` removing them does not cause any issues, as the compiler automatically derives those. This PR adds requirement to document the unsafety (see [clippy::undocumented_unsafe_blocks]) and opportunistically adds `#![deny(unsafe_code)]` to most places where we don't have unsafe code right now. TRPL on Send and Sync: https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html [clippy::undocumented_unsafe_blocks]: https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks	2023-11-07 10:26:25 +00:00
Conrad Ludgate	ad5b02e175	proxy: remove unsafe (#5805 ) ## Problem `unsafe {}` ## Summary of changes `CStr` has a method to parse the bytes up to a null byte, so we don't have to do it ourselves.	2023-11-06 17:44:44 +00:00
Conrad Ludgate	cdcaa329bf	proxy: no more statements (#5747 ) ## Problem my prepared statements change in tokio-postgres landed in the latest release. it didn't work as we intended ## Summary of changes https://github.com/neondatabase/rust-postgres/pull/24	2023-11-03 08:30:58 +00:00
khanova	4c7fa12a2a	Proxy introduce allowed ips (#5729 ) ## Problem Proxy doesn't accept wake_compute responses with the allowed IPs. ## Summary of changes Extend wake_compute api to be able to return allowed_ips.	2023-11-02 16:26:15 +00:00
Muhammet Yazici	4f0a8e92ad	fix: Add bearer prefix to Authorization header (#5740 ) ## Problem Some requests with `Authorization` header did not properly set the `Bearer ` prefix. Problem explained here https://github.com/neondatabase/cloud/issues/6390. ## Summary of changes Added `Bearer ` prefix to missing requests.	2023-11-01 09:41:48 +03:00
Conrad Ludgate	d8c21ec70d	fix nightly 1.75 (#5719 ) ## Problem Neon doesn't compile on nightly and had numerous clippy complaints. ## Summary of changes 1. Fixed troublesome dependency 2. Fixed or ignored the lints where appropriate	2023-10-30 16:43:06 +00:00
Conrad Ludgate	964c5c56b7	proxy: dont retry server errors (#5694 ) ## Problem accidental spam ## Summary of changes don't spam control plane if control plane is down :) ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-10-30 08:38:56 +00:00
Arpad Müller	bd59349af3	Fix Rust 1.74 warnings (#5702 ) Fixes new warnings and clippy changes introduced by version 1.74 of the rust compiler toolchain.	2023-10-28 03:47:26 +02:00
Conrad Ludgate	493b47e1da	proxy: exclude client latencies in metrics (#5688 ) ## Problem In #5539, I moved the connect_to_compute latency to start counting before authentication - this is because authentication will perform some calls to the control plane in order to get credentials and to eagerly wake a compute server. It felt important to include these times in the latency metric as these are times we should definitely care about reducing. What is not interesting to record in this metric is the roundtrip time during authentication when we wait for the client to respond. ## Summary of changes Implement a mechanism to pause the latency timer, resuming on drop of the pause struct. We pause the timer right before we send the authentication message to the client, and we resume the timer right after we complete the authentication flow.	2023-10-27 17:17:39 +00:00
duguorong009	39f8fd6945	feat: add `build_tag` env support for `set_build_info_metric` (#5576 ) - Add a new util `project_build_tag` macro, similar to `project_git_version` - Update the `set_build_info_metric` to accept and make use of `build_tag` info - Update all codes which use the `set_build_info_metric`	2023-10-27 10:47:11 +01:00
Conrad Ludgate	71611f4ab3	proxy: prepare to remove high cardinality metrics (#5461 ) ## Problem High cardinality metrics are bad ## Summary of changes Preparing to remove high cardinality metrics. Will actually remove in #5466	2023-10-26 22:54:37 +01:00
Conrad Ludgate	32126d705b	proxy refactor serverless (#4685 ) ## Problem Our serverless backend was a bit jumbled. As a comment indicated, we were handling SQL-over-HTTP in our `websocket.rs` file. I've extracted out the `sql_over_http` and `websocket` files from the `http` module and put them into a new module called `serverless`. ## Summary of changes ```sh mkdir proxy/src/serverless mv proxy/src/http/{conn_pool,sql_over_http,websocket}.rs proxy/src/serverless/ mv proxy/src/http/server.rs proxy/src/http/health_server.rs mv proxy/src/metrics proxy/src/usage_metrics.rs ``` I have also extracted the hyper server and handler from websocket.rs into `serverless.rs`	2023-10-25 15:43:03 +01:00
Conrad Ludgate	a461c459d8	fix http pool test (#5653 ) ## Problem We defer the returning of connections the the connection pool. It's possible for our test to be faster than the returning of connections - which then gets a differing process ID because it opens a new connection. ## Summary of changes 1. Delay the tests just a little (20ms) to give more chance for connections to return. 2. Correlate connection IDs with the connection logs a bit more	2023-10-25 13:20:45 +01:00
Conrad Ludgate	b2c96047d0	move wake compute after the auth quirks logic (#5642 ) ## Problem https://github.com/neondatabase/neon/issues/5568#issuecomment-1777015606 ## Summary of changes Make the auth_quirks_creds return the authentication information, and push the wake_compute loop to after, inside `auth_quirks`	2023-10-25 08:30:47 +01:00
Conrad Ludgate	767ef29390	proxy: filter out more quota exceeded errors (#5640 ) ## Problem Looking at logs, I saw more retries being performed for other quota exceeded errors ## Summary of changes Filter out all quota exceeded family of errors	2023-10-24 13:13:23 +01:00
Conrad Ludgate	94b4e76e13	proxy: latency connect outcome (#5588 ) ## Problem I recently updated the latency timers to include cache miss and pool miss, as well as connection protocol. By moving the latency timer to start before authentication, we count a lot more failures and it's messed up the latency dashboard. ## Summary of changes Add another label to LatencyTimer metrics for outcome. Explicitly report on success	2023-10-23 15:17:28 +01:00
khanova	b514da90cb	Set up timeout for scram protocol execution (#5551 ) ## Problem Context: https://github.com/neondatabase/neon/issues/5511#issuecomment-1759649679 Some of out scram protocol execution timed out only after 17 minutes. ## Summary of changes Make timeout for scram execution meaningful and configurable.	2023-10-23 15:11:05 +01:00
Conrad Ludgate	7d17f1719f	reduce cancel map contention (#5555 ) ## Problem Every database request locks this cancel map rwlock. At high requests per second this would have high contention ## Summary of changes Switch to dashmap which has a sharded rwlock to reduce contention	2023-10-23 14:12:41 +01:00
Conrad Ludgate	543b8153c6	proxy: add flag to reject requests without proxy protocol client ip (#5417 ) ## Problem We need a flag to require proxy protocol (prerequisite for #5416) ## Summary of changes Add a cli flag to require client IP addresses. Error if IP address is missing when the flag is active.	2023-10-17 16:59:35 +01:00
Conrad Ludgate	f775928dfc	proxy: refactor how and when connections are returned to the pool (#5095 ) ## Problem Transactions break connections in the pool fixes #4698 ## Summary of changes * Pool `Client`s are smart object that return themselves to the pool * Pool `Client`s can be 'discard'ed * Pool `Client`s are discarded when certain errors are encountered. * Pool `Client`s are discarded when ReadyForQuery returns a non-idle state.	2023-10-17 13:55:52 +00:00
Arpad Müller	093f8c5f45	Update rust to 1.73.0 (#5574 ) [Release notes](https://blog.rust-lang.org/2023/10/05/Rust-1.73.0.html)	2023-10-17 13:13:12 +01:00
Conrad Ludgate	8c522ea034	proxy: count cache-miss for compute latency (#5539 ) ## Problem Would be good to view latency for hot-path vs cold-path ## Summary of changes add some labels to latency metrics	2023-10-16 16:31:04 +01:00
khanova	21deb81acb	Fix case for array of jsons (#5523 ) ## Problem Currently proxy doesn't handle array of json parameters correctly. ## Summary of changes Added one more level of quotes escaping for the array of jsons case. Resolves: https://github.com/neondatabase/neon/issues/5515	2023-10-12 14:32:49 +02:00
khanova	dbb21d6592	Make http timeout configurable (#5532 ) ## Problem Currently http timeout is hardcoded to 15 seconds. ## Summary of changes Added an option to configure it via cli args. Context: https://neondb.slack.com/archives/C04DGM6SMTM/p1696941726151899	2023-10-12 11:41:07 +02:00
Conrad Ludgate	d4dc86f8e3	proxy: more connection metrics (#5464 ) ## Problem Hard to tell 1. How many clients are connected to proxy 2. How many requests clients are making 3. How many connections are made to a database 1 and 2 are different because of the properties of HTTP. We have 2 already tracked through `proxy_accepted_connections_total` and `proxy_closed_connections_total`, but nothing for 1 and 3 ## Summary of changes Adds 2 new counter gauges. * `proxy_opened_client_connections_total`,`proxy_closed_client_connections_total` - how many client connections are open to proxy * `proxy_opened_db_connections_total`,`proxy_closed_db_connections_total` - how many active connections are made through to a database. For TCP and Websockets, we expect all 3 of these quantities to be roughly the same, barring users connecting but with invalid details. For HTTP: * client_connections/connections can differ because the client connections can be reused. * connections/db_connections can differ because of connection pooling.	2023-10-10 16:33:20 +01:00
Alex Chi Z	5158de70f3	proxy: breakdown wake up failure metrics (#4933 ) ## Problem close https://github.com/neondatabase/neon/issues/4702 ## Summary of changes This PR adds a new metrics for wake up errors and breaks it down by most common reasons (mostly follows the `could_retry` implementation).	2023-10-10 13:17:37 +01:00
khanova	aec9188d36	Added timeout for http requests (#5514 ) # Problem Proxy timeout for HTTP-requests ## Summary of changes If the HTTP-request exceeds 15s, it would be killed. Resolves: https://github.com/neondatabase/neon/issues/4847	2023-10-10 13:39:38 +02:00
Conrad Ludgate	bf065aabdf	proxy: update locked error retry filter (#5376 ) ## Problem We don't want to retry customer quota exhaustion errors. ## Summary of changes Make sure both types of quota exhaustion errors are not retried	2023-10-10 08:59:16 +01:00
Conrad Ludgate	c216b16b0f	proxy: fix memory leak (#5472 ) ## Problem these JoinSets live for the duration of the process. they might have many millions of connections spawned on them and they never get cleared. Fixes #4672 ## Summary of changes Drain the connections as we go	2023-10-05 07:30:28 +01:00
Conrad Ludgate	f002b1a219	proxy: http limits (#5460 ) ## Problem 1MB request body is apparently too small for some clients ## Summary of changes Update to 10 MB request body. Also revert the removal of response limits while we don't have streaming support.	2023-10-04 15:01:05 +01:00
Conrad Ludgate	fd20bbc6cb	proxy: log params when no endpoint (#5418 ) ## Problem Our SNI error dashboard features IP addresses but it's not immediately clear who that is still (#5369) ## Summary of changes Log some startup params with this error	2023-09-29 09:40:27 +01:00
Conrad Ludgate	528fb1bd81	proxy: metrics2 (#5179 ) ## Problem We need to count metrics always when a connection is open. Not only when the transfer is 0. We also need to count bytes usage for HTTP. ## Summary of changes New structure for usage metrics. A `DashMap<Ids, Arc<Counters>>`. If the arc has 1 owner (the map) then I can conclude that no connections are open. If the counters has "open_connections" non zero, then I can conclude a new connection was opened in the last interval and should be reported on. Also, keep count of how many bytes processed for HTTP and report it here.	2023-09-28 11:38:26 +01:00
George MacKerron	d8977d5199	Altered retry timing parameters for connect to compute, to get more and quicker retries (#5358 ) ## Problem Compute start time has improved, but the timing of connection retries from the proxy is rather slow, meaning we could be making clients wait hundreds of milliseconds longer than necessary. ## Summary of changes Previously, retry time in ms was `100 * 1.5*n`, and `n` starts at 1, giving: 150, 225, 337, 506, 759, 1139, 1709, ... This PR changes that to `25 sqrt(2)**(n - 1)` instead, giving: 25, 35, 50, 71, 100, 141, 200, ...	2023-09-25 12:27:41 +01:00
Joonas Koivunen	c5d226d9c7	refactor(consumption_metrics): prereq refactorings, tests (#5315 ) Split off from #5297. There should be no functional changes here: - refactor tenant metric "production" like previously timeline, allows unit testing, though not interesting enough yet to test - introduce type aliases for tuples - extra refactoring for `collect`, was initially thinking it was useful but will do a inline later - shorter binding names - support for future allocation reuse quests with IdempotencyKey - move code out of tokio::select to make it rustfmt-able - generification, allow later replacement of `&'static str` with enum - add tests that assert sent event contents exactly	2023-09-15 19:44:14 +03:00
Nikita Kalyanov	77658a155b	support deploying in IPv6-only environments (#4135 ) A set of changes to enable neon to work in IPv6 environments. The changes are backward-compatible but allow to deploy neon even to IPv6-only environments: - bind to both IPv4 and IPv6 interfaces - allow connections to Postgres from IPv6 interface - parse the address from control plane that could also be IPv6	2023-09-05 12:45:46 +03:00
Conrad Ludgate	44da9c38e0	proxy: error typo (#5187 ) ## Problem https://github.com/neondatabase/neon/pull/5162#discussion_r1311853491	2023-09-01 19:21:33 +03:00
Conrad Ludgate	1b916a105a	proxy: locked is not retriable (#5162 ) ## Problem Management service returns Locked when quotas are exhausted. We cannot retry on those ## Summary of changes Makes Locked status unretriable	2023-08-31 15:50:15 +03:00
Conrad Ludgate	d11621d904	Proxy: proxy protocol v2 (#5028 ) ## Problem We need to log the client IP, not the IP of the NLB. ## Summary of changes Parse the proxy [protocol version 2](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) if possible	2023-08-31 14:30:25 +03:00
Nikita Kalyanov	b9c111962f	pass JWT to management API (#5151 ) support authentication with JWT from env for proxy calls to mgmt API	2023-08-31 12:23:51 +03:00
Conrad Ludgate	93dcdb293a	proxy: password hack hack (#5126 ) ## Problem fixes #4881 ## Summary of changes	2023-08-30 16:20:27 +01:00
Conrad Ludgate	3b81e0c86d	chore: remove webpki (#5069 ) ## Problem webpki is unmaintained Closes https://github.com/neondatabase/neon/security/dependabot/33 ## Summary of changes Update all dependents of webpki.	2023-08-30 15:14:03 +01:00
Conrad Ludgate	faf070f288	proxy: dont return connection pending (#5107 ) ## Problem We were returning Pending when a connection had a notice/notification (introduced recently in #5020). When returning pending, the runtime assumes you will call `cx.waker().wake()` in order to continue processing. We weren't doing that, so the connection task would get stuck ## Summary of changes Don't return pending. Loop instead	2023-08-25 15:08:45 +03:00
Conrad Ludgate	0b001a0001	proxy: remove connections on shutdown (#5051 ) ## Problem On shutdown, proxy connections are staying open. ## Summary of changes Remove the connections on shutdown	2023-08-21 19:20:58 +01:00

1 2 3 4 5 ...

270 Commits