rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Ruslan Talpa	0dbe551802	proxy: subzero integration in auth-broker (embedded data-api) (#12474 ) ## Problem We want to have the data-api served by the proxy directly instead of relying on a 3rd party to run a deployment for each project/endpoint. ## Summary of changes With the changes below, the proxy (auth-broker) becomes also a "rest-broker", that can be thought of as a "Multi-tenant" data-api which provides an automated REST api for all the databases in the region. The core of the implementation (that leverages the subzero library) is in proxy/src/serverless/rest.rs and this is the only place that has "new logic". --------- Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-07-21 18:16:28 +00:00
Krzysztof Szafrański	e2982ed3ec	[proxy] Cache node info only for TTL, even if Redis is available (#12626 ) This PR simplifies our node info cache. Now we'll store entries for at most the TTL duration, even if Redis notifications are available. This will allow us to cache intermittent errors later (e.g. due to rate limits) with more predictable behavior. Related to https://github.com/neondatabase/cloud/issues/19353	2025-07-16 16:23:05 +00:00
Folke Behrens	380d167b7c	proxy: For cancellation data replace HSET+EXPIRE/HGET with SET..EX/GET (#12553 ) ## Problem To store cancellation data we send two commands to redis because the redis server version doesn't support HSET with EX. Also, HSET is not really needed. ## Summary of changes * Replace the HSET + EXPIRE command pair with one SET .. EX command. * Replace HGET with GET. * Leave a workaround for old keys set with HSET. * Replace some anyhow errors with specific errors to surface the WRONGTYPE error from redis.	2025-07-11 19:35:42 +00:00
Conrad Ludgate	03e604e432	Nightly lints and small tweaks (#12456 ) Let chains available in 1.88 :D new clippy lints coming up in future releases.	2025-07-03 14:47:12 +00:00
Conrad Ludgate	1bc1eae5e8	fix redis credentials check (#12455 ) ## Problem `keep_connection` does not exit, so it was never setting `credentials_refreshed`. ## Summary of changes Set `credentials_refreshed` to true when we first establish a connection, and after we re-authenticate the connection.	2025-07-03 09:51:35 +00:00
Ivan Efremov	0f879a2e8f	[proxy]: Fix redis IRSA expiration failure errors (#12430 ) Relates to the [#30688](https://github.com/neondatabase/cloud/issues/30688)	2025-07-02 08:55:44 +00:00
Conrad Ludgate	fd1e8ec257	[proxy] review and cleanup CLI args (#12167 ) I was looking at how we could expose our proxy config as toml again, and as I was writing out the schema format, I noticed some cruft in our CLI args that no longer seem to be in use. The redis change is the most complex, but I am pretty sure it's sound. Since https://github.com/neondatabase/cloud/pull/15613 cplane longer publishes to the global redis instance.	2025-06-26 11:25:41 +00:00
Conrad Ludgate	517a3d0d86	[proxy]: BatchQueue::call is not cancel safe - make it directly cancellation aware (#12345 ) ## Problem https://github.com/neondatabase/cloud/issues/30539 If the current leader cancels the `call` function, then it has removed the jobs from the queue, but will never finish sending the responses. Because of this, it is not cancellation safe. ## Summary of changes Document these functions as not cancellation safe. Move cancellation of the queued jobs into the queue itself. ## Alternatives considered 1. We could spawn the task that runs the batch, since that won't get cancelled. * This requires `fn call(self: Arc<Self>)` or `fn call(&'static self)`. 2. We could add another scopeguard and return the requests back to the queue. * This requires that requests are always retry safe, and also requires requests to be `Clone`.	2025-06-25 14:19:20 +00:00
Conrad Ludgate	a298d2c29b	[proxy] replace the batch cancellation queue, shorten the TTL for cancel keys (#11943 ) See #11942 Idea: * if connections are short lived, they can get enqueued and then also remove themselves later if they never made it to redis. This reduces the load on the queue. * short lived connections (<10m, most?) will only issue 1 command, we remove the delete command and rely on ttl. * we can enqueue as many commands as we want, as we can always cancel the enqueue, thanks to the ~~intrusive linked lists~~ `BTreeMap`.	2025-06-20 11:48:01 +00:00
Conrad Ludgate	6dd84041a1	refactor and simplify the invalidation notification structure (#12154 ) The current cache invalidation messages are far too specific. They should be more generic since it only ends up triggering a `GetEndpointAccessControl` message anyway. Mappings: * `/allowed_ips_updated`, `/block_public_or_vpc_access_updated`, and `/allowed_vpc_endpoints_updated_for_projects` -> `/project_settings_update`. * `/allowed_vpc_endpoints_updated_for_org` -> `/account_settings_update`. * `/password_updated` -> `/role_setting_update`. I've also introduced `/endpoint_settings_update`. All message types support singular or multiple entries, which allows us to simplify things both on our side and on cplane side. I'm opening a PR to cplane to apply the above mappings, but for now using the old phrases to allow both to roll out independently. This change is inspired by my need to add yet another cached entry to `GetEndpointAccessControl` for https://github.com/neondatabase/cloud/issues/28333	2025-06-06 12:49:29 +00:00
Conrad Ludgate	589bfdfd02	proxy: Changes to rate limits and GetEndpointAccessControl caches. (#12048 ) Precursor to https://github.com/neondatabase/cloud/issues/28333. We want per-endpoint configuration for rate limits, which will be distributed via the `GetEndpointAccessControl` API. This lays some of the ground work. 1. Allow the endpoint rate limiter to accept a custom leaky bucket config on check. 2. Remove the unused auth rate limiter, as I don't want to think about how it fits into this. 3. Refactor the caching of `GetEndpointAccessControl`, as it adds friction for adding new cached data to the API. That third one was rather large. I couldn't find any way to split it up. The core idea is that there's now only 2 cache APIs. `get_endpoint_access_controls` and `get_role_access_controls`. I'm pretty sure the behaviour is unchanged, except I did a drive by change to fix #8989 because it felt harmless. The change in question is that when a password validation fails, we eagerly expire the role cache if the role was cached for 5 minutes. This is to allow for edge cases where a user tries to connect with a reset password, but the cache never expires the entry due to some redis related quirk (lag, or misconfiguration, or cplane error)	2025-06-02 08:38:35 +00:00
Conrad Ludgate	87179e26b3	completely rewrite pq_proto (#12085 ) libs/pqproto is designed for safekeeper/pageserver with maximum throughput. proxy only needs it for handshakes/authentication where throughput is not a concern but memory efficiency is. For this reason, we switch to using read_exact and only allocating as much memory as we need to. All reads return a `&'a [u8]` instead of a `Bytes` because accidental sharing of bytes can cause fragmentation. Returning the reference enforces all callers only hold onto the bytes they absolutely need. For example, before this change, `pqproto` was allocating 8KiB for the initial read `BytesMut`, and proxy was holding the `Bytes` in the `StartupMessageParams` for the entire connection through to passthrough.	2025-06-01 18:41:45 +00:00
Conrad Ludgate	a113c48c43	proxy: fix redis batching support (#11905 ) ## Problem For `StoreCancelKey`, we were inserting 2 commands, but we were not inserting two replies. This mismatch leads to errors when decoding the response. ## Summary of changes Abstract the command + reply pipeline so that commands and replies are registered at the same time.	2025-05-13 08:33:53 +00:00
Conrad Ludgate	946e971df8	feat(proxy): add batching to cancellation queue processing (#10607 ) Add batching to the redis queue, which allows us to clear it out quicker should it slow down temporarily.	2025-04-12 09:16:22 +00:00
Folke Behrens	1dbf40ee2c	proxy: Update redis crate (#11372 )	2025-03-28 11:43:52 +00:00
Arpad Müller	fdde58120c	Upgrade proxy crates to edition 2024 (#10942 ) This upgrades the `proxy/` crate as well as the forked libraries in `libs/proxy/` to edition 2024. Also reformats the imports of those forked libraries via: ``` cargo +nightly fmt -p proxy -p postgres-protocol2 -p postgres-types2 -p tokio-postgres2 -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true ``` It can be read commit-by-commit: the first commit has no formatting changes, only changes to accomodate the new edition. Part of #10918	2025-02-24 15:26:28 +00:00
Folke Behrens	da7496e1ee	proxy: Post-refactor + future clippy lint cleanup (#10824 ) * Clean up deps and code after logging and binary refactor * Also include future clippy lint cleanup	2025-02-14 12:34:09 +00:00
Folke Behrens	f62bc28086	proxy: Move binaries into the lib (#10758 ) * This way all clippy lints defined in the lib also cover the binary code. * It's much easier to detect unused code. * Fix all discovered lints.	2025-02-11 19:46:23 +00:00
Stefan Radig	6dd48ba148	feat(proxy): Implement access control with VPC endpoint checks and block for public internet / VPC (#10143 ) - Wired up filtering on VPC endpoints - Wired up block access from public internet / VPC depending on per project flag - Added cache invalidation for VPC endpoints (partially based on PR from Raphael) - Removed BackendIpAllowlist trait --------- Co-authored-by: Ivan Efremov <ivan@neon.tech>	2025-01-31 20:32:57 +00:00
Conrad Ludgate	738bf83583	chore: replace dashmap with clashmap (#10582 ) ## Problem Because dashmap 6 switched to hashbrown RawTable API, it required us to use unsafe code in the upgrade: https://github.com/neondatabase/neon/pull/8107 ## Summary of changes Switch to clashmap, a fork maintained by me which removes much of the unsafe and ultimately switches to HashTable instead of RawTable to remove much of the unsafe requirement on us.	2025-01-31 09:53:43 +00:00
Ivan Efremov	222cc181e9	impr(proxy): Move the CancelMap to Redis hashes (#10364 ) ## Problem The approach of having CancelMap as an in-memory structure increases code complexity, as well as putting additional load for Redis streams. ## Summary of changes - Implement a set of KV ops for Redis client; - Remove cancel notifications code; - Send KV ops over the bounded channel to the handling background task for removing and adding the cancel keys. Closes #9660	2025-01-29 11:19:10 +00:00
Folke Behrens	77660f3d88	proxy: Fix parsing of UnknownTopic with payload (#10339 ) ## Problem When the proxy receives a `Notification` with an unknown topic it's supposed to use the `UnknownTopic` unit variant. Unfortunately, in adjacently tagged enums serde will not simply ignore the configured content if found and try to deserialize a map/object instead. ## Summary of changes * Use a custom deserialize function to ignore variant content. * Add a little unit test covering both cases.	2025-01-10 09:12:31 +00:00
Folke Behrens	0a117fb1f1	proxy: Parse Notification twice only for unknown topic (#10296 ) ## Problem We currently parse Notification twice even in the happy path. ## Summary of changes Use `#[serde(other)]` to catch unknown topics and defer the second parsing.	2025-01-07 15:24:54 +00:00
Conrad Ludgate	fda52a0005	feat(proxy): dont trigger error alerts for unknown topics (#10266 ) ## Problem Before the holidays, and just before our code freeze, a change to cplane was made that started publishing the topics from #10197. This triggered our alerts and put us in a sticky situation as it was not an error, and we didn't want to silence the alert for the entire holidays, and we didn't want to release proxy 2 days in a row if it was not essential. We fixed it eventually by rewriting the alert based on logs, but this is not a good solution. ## Summary of changes Introduces an intermediate parsing step to check the topic name first, to allow us to ignore parsing errors for any topics we do not know about.	2025-01-06 13:05:35 +00:00
Raphael 'kena' Poss	1622fd8bda	proxy: recognize but ignore the 3 new redis message types (#10197 ) ## Problem https://neondb.slack.com/archives/C085MBDUSS2/p1734604792755369 ## Summary of changes Recognize and ignore the 3 new broadcast messages: - `/block_public_or_vpc_access_updated` - `/allowed_vpc_endpoints_updated_for_org` - `/allowed_vpc_endpoints_updated_for_projects`	2025-01-02 16:02:48 +00:00
Conrad Ludgate	38c7a2abfc	chore(proxy): pre-load native tls certificates and propagate compute client config (#10182 ) Now that we construct the TLS client config for cancellation as well as connect, it feels appropriate to construct the same config once and re-use it elsewhere. It might also help should #7500 require any extra setup, so we can easily add it to all the appropriate call sites.	2025-01-02 09:36:13 +00:00
Conrad Ludgate	59b7ff8988	chore(proxy): disallow unwrap and unimplemented (#10142 ) As the title says, I updated the lint rules to no longer allow unwrap or unimplemented. Three special cases: * Tests are allowed to use them * std::sync::Mutex lock().unwrap() is common because it's usually correct to continue panicking on poison * `tokio::spawn_blocking(...).await.unwrap()` is common because it will only error if the blocking fn panics, so continuing the panic is also correct I've introduced two extension traits to help with these last two, that are a bit more explicit so they don't need an expect message every time.	2024-12-16 16:37:15 +00:00
Conrad Ludgate	24d6587914	chore(proxy): refactor self-signed config (#10154 ) ## Problem While reviewing #10152 I found it tricky to actually determine whether the connection used `allow_self_signed_compute` or not. I've tried to remove this setting in the past: * https://github.com/neondatabase/neon/pull/7884 * https://github.com/neondatabase/neon/pull/7437 * https://github.com/neondatabase/cloud/pull/13702 But each time it seems it is used by e2e tests ## Summary of changes The `node_info.allow_self_signed_computes` is always initialised to false, and then sometimes inherits the proxy config value. There's no need this needs to be in the node_info, so removing it and propagating it via `TcpMechansim` is simpler.	2024-12-16 11:15:25 +00:00
Ivan Efremov	34c1295594	[proxy] impr: Additional logging for cancellation queries (#10039 ) ## Problem Since cancellation tasks spawned in the background sometimes logs missing context. https://neondb.slack.com/archives/C060N3SEF9D/p1733427801527419?thread_ts=1733419882.560159&cid=C060N3SEF9D ## Summary of changes Add `session_id` and change loglevel for cancellation queries	2024-12-10 10:14:28 +00:00
Folke Behrens	1b60571636	proxy: Create Elasticache credentials provider lazily (#9967 ) ## Problem The credentials providers tries to connect to AWS STS even when we use plain Redis connections. ## Summary of changes * Construct the CredentialsProvider only when needed ("irsa").	2024-12-02 15:38:12 +00:00
Ivan Efremov	8173dc600a	proxy: spawn cancellation checks in the background (#9918 ) ## Problem For cancellation, a connection is open during all the cancel checks. ## Summary of changes Spawn cancellation checks in the background, and close connection immediately. Use task_tracker for cancellation checks.	2024-11-28 06:32:22 +00:00
Conrad Ludgate	96a1b71c84	chore(proxy): discard request context span during passthrough (#9882 ) ## Problem The RequestContext::span shouldn't live for the entire postgres connection, only the handshake. ## Summary of changes * Slight refactor to the RequestContext to discard the span upon handshake completion. * Make sure the temporary future for the handshake is dropped (not bound to a variable) * Runs our nightly fmt script	2024-11-25 21:32:53 +00:00
Ivan Efremov	3b1ac8b14a	proxy: Implement cancellation rate limiting (#9739 ) Implement cancellation rate limiting and ip allowlist checks. Add ip_allowlist to the cancel closure Fixes [#16456](https://github.com/neondatabase/cloud/issues/16456)	2024-11-22 16:46:38 +00:00
Conrad Ludgate	f36f0068b8	chore(proxy): demote more logs during successful connection attempts (#9828 ) Follow up to #9803 See https://github.com/neondatabase/cloud/issues/14378 In collaboration with @cloneable and @awarus, we sifted through logs and simply demoted some logs to debug. This is not at all finished and there are more logs to review, but we ran out of time in the session we organised. In any slightly more nuanced cases, we didn't touch the log, instead leaving a TODO comment. I've also slightly refactored the sql-over-http body read/length reject code. I can split that into a separate PR. It just felt natural after I switched to `read_body_with_limit` as we discussed during the meet.	2024-11-20 17:50:39 +00:00
Folke Behrens	92d5e0e87a	proxy: clear lib.rs of code items (#9479 ) We keep lib.rs for crate configs, lint configs and re-exports for the binaries.	2024-10-23 08:21:28 +02:00
Folke Behrens	f14e45f0ce	proxy: format imports with nightly rustfmt (#9414 ) ```shell cargo +nightly fmt -p proxy -- -l --config imports_granularity=Module,group_imports=StdExternalCrate,reorder_imports=true ``` These rust-analyzer settings for VSCode should help retain this style: ```json "rust-analyzer.imports.group.enable": true, "rust-analyzer.imports.prefix": "crate", "rust-analyzer.imports.merge.glob": false, "rust-analyzer.imports.granularity.group": "module", "rust-analyzer.imports.granularity.enforce": true, ```	2024-10-16 15:01:56 +02:00
Folke Behrens	6baf1aae33	proxy: Demote some errors to warnings in logs (#9354 )	2024-10-11 11:29:08 +02:00
Folke Behrens	ad267d849f	proxy: Move module base files into module directory (#9297 )	2024-10-07 16:25:34 +02:00
Folke Behrens	bae793ffcd	proxy: Handle all let underscore instances (#8898 ) * Most can be simply replaced * One instance renamed to _rtchk (return-type check)	2024-09-10 15:36:08 +02:00
Conrad Ludgate	12850dd5e9	proxy: remove dead code (#8847 ) By marking everything possible as pub(crate), we find a few dead code candidates.	2024-08-27 12:00:35 +01:00
Folke Behrens	d6eede515a	proxy: clippy lints: handle some low hanging fruit (#8829 ) Should be mostly uncontroversial ones.	2024-08-26 15:16:54 +02:00
Folke Behrens	f246aa3ca7	proxy: Fix some warnings by extended clippy checks (#8748 ) * Missing blank lifetimes which is now deprecated. * Matching off unqualified enum variants that could act like variable. * Missing semicolons.	2024-08-19 10:33:46 +02:00
Conrad Ludgate	411a130675	Fix nightly warnings 2024 june (#8151 ) ## Problem new clippy warnings on nightly. ## Summary of changes broken up each commit by warning type. 1. Remove some unnecessary refs. 2. In edition 2024, inference will default to `!` and not `()`. 3. Clippy complains about doc comment indentation 4. Fix `Trait + ?Sized` where `Trait: Sized`. 5. diesel_derives triggering `non_local_defintions`	2024-07-12 13:58:04 +01:00
Anna Khanova	240efb82f9	Proxy reconnect pubsub before expiration (#7562 ) ## Problem Proxy reconnects to redis only after it's already unavailable. ## Summary of changes Reconnects every 6h.	2024-05-03 10:00:29 +02:00
Anna Khanova	25af32e834	proxy: keep track on the number of events from redis by type. (#7582 ) ## Problem It's unclear what is the distribution of messages, proxy is consuming from redis. ## Summary of changes Add counter.	2024-05-02 09:50:11 +00:00
Anna Khanova	fd49005cb3	proxy: Improve logging (#7405 ) ## Problem It's unclear from logs what's going on with the regional redis. ## Summary of changes Make logs better.	2024-04-17 11:33:31 +00:00
Anna Khanova	40f15c3123	Read cplane events from regional redis (#7352 ) ## Problem Actually read redis events. ## Summary of changes This is revert of https://github.com/neondatabase/neon/pull/7350 + fixes. * Fixed events parsing * Added timeout after connection failure * Separated regional and global redis clients.	2024-04-11 18:24:34 +00:00
Conrad Ludgate	5299f917d6	proxy: replace prometheus with measured (#6717 ) ## Problem My benchmarks show that prometheus is not very good. https://github.com/conradludgate/measured We're already using it in storage_controller and it seems to be working well. ## Summary of changes Replace prometheus with my new measured crate in proxy only. Apologies for the large diff. I tried to keep it as minimal as I could. The label types add a bit of boiler plate (but reduce the chance we mistype the labels), and some of our custom metrics like CounterPair and HLL needed to be rewritten.	2024-04-11 16:26:01 +00:00
Anna Khanova	0bb04ebe19	Revert "Proxy read ids from redis (#7205 )" (#7350 ) This reverts commit `dbac2d2c47`. ## Problem Proxy pods fails to install in k8s clusters, cplane release blocking. ## Summary of changes Revert	2024-04-10 10:12:55 +00:00
Anna Khanova	dbac2d2c47	Proxy read ids from redis (#7205 ) ## Problem Proxy doesn't know about existing endpoints. ## Summary of changes * Added caching of all available endpoints. * On the high load, use it before going to cplane. * Report metrics for the outcome. * For rate limiter and credentials caching don't distinguish between `-pooled` and not TODOs: * Make metrics more meaningful * Consider integrating it with the endpoint rate limiter * Test it together with cplane in preview	2024-04-10 02:40:14 +02:00

1 2

58 Commits