rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	3778081b7b	eliminate needless reconnect	2025-07-28 12:04:00 +01:00
Conrad Ludgate	2ddbd3cc80	allow longer JWTs	2025-07-28 12:04:00 +01:00
Conrad Ludgate	5a293242f8	hack around the fact that the TLS cert is not a wildcard	2025-07-28 12:04:00 +01:00
Conrad Ludgate	ccea44becd	disable channel binding	2025-07-28 12:04:00 +01:00
Conrad Ludgate	2c915e2f3d	support mtls	2025-07-28 12:04:00 +01:00
Conrad Ludgate	2b22e0b069	use SNI for cancellation routing	2025-07-28 12:04:00 +01:00
Conrad Ludgate	739ecc6f6d	send compute IP address to regional ingress	2025-07-28 12:04:00 +01:00
Conrad Ludgate	725aed694b	do not replace cancelkeydata	2025-07-28 12:04:00 +01:00
Conrad Ludgate	d0e579c026	delay authenticationok until after connect_to_compute	2025-07-28 12:04:00 +01:00
Conrad Ludgate	56cc55d24a	do not validate passwords, just forward them onto postgres dont get access controls add a new cleartext authsecret instead	2025-07-28 12:03:58 +01:00
Conrad Ludgate	da6419a45a	expose lakebase-v1 as a flag	2025-07-28 11:59:27 +01:00
Conrad Ludgate	e5f5c79eb1	add lakebase_v1 cplane impl	2025-07-28 11:59:27 +01:00
Conrad Ludgate	314babb0cb	add ignored lints for the sake of the diff	2025-07-28 11:59:27 +01:00
Conrad Ludgate	effd6bf829	[proxy] add metrics for caches (#12752 ) Exposes metrics for caches. LKB-2594 This exposes a high level namespace, `cache`, that all cache metrics can be added to - this makes it easier to make library panels for the caches as I understand it. To calculate the current cache fill ratio, you could use the following query: ``` ( cache_inserted_total{cache="node_info"} - sum (cache_evicted_total{cache="node_info"}) without (cause) ) / cache_capacity{cache="node_info"} ``` To calculate the cache hit ratio, you could use the following query: ``` cache_request_total{cache="node_info", outcome="hit"} / sum (cache_request_total{cache="node_info"}) without (outcome) ```	2025-07-28 10:41:49 +00:00
Folke Behrens	25718e324a	proxy: Define service_info metric showing the run state (#12749 ) ## Problem Monitoring dashboards show aggregates of all proxy instances, including terminating ones. This can skew the results or make graphs less readable. Also, alerts must be tuned to ignore certain signals from terminating proxies. ## Summary of changes Add a `service_info` metric currently with one label, `state`, showing if an instance is in state `init`, `running`, or `terminating`. The metric can be joined with other metrics to filter the presented time series.	2025-07-25 18:27:21 +00:00
Conrad Ludgate	d09664f039	[proxy] replace TimedLru with moka (#12726 ) LKB-2536 TimedLru is hard to maintain. Let's use moka instead. Stacked on top of #12710.	2025-07-25 17:39:48 +00:00
Conrad Ludgate	d19aebcf12	[proxy] introduce moka for the project-info cache (#12710 ) ## Problem LKB-2502 The garbage collection of the project info cache is garbage. What we observed: If we get unlucky, we might throw away a very hot entry if the cache is full. The GC loop is dependent on getting a lucky shard of the projects2ep table that clears a lot of cold entries. The GC does not take into account active use, and the interval it runs at is too sparse to do any good. Can we switch to a proper cache implementation? Complications: 1. We need to invalidate by project/account. 2. We need to expire based on `retry_delay_ms`. ## Summary of changes 1. Replace `retry_delay_ms: Duration` with `retry_at: Instant` when deserializing. 2. Split the EndpointControls from the RoleControls into two different caches. 3. Introduce an expiry policy based on error retry info. 4. Introduce `moka` as a dependency, replacing our `TimedLru`. See the follow up PR for changing all TimedLru instances to use moka: #12726.	2025-07-25 11:40:47 +00:00
Conrad Ludgate	a70a5bccff	move subzero_core to proxy libs (#12742 ) We have a dedicated libs folder for proxy related libraries. Let's move the subzero_core stub there.	2025-07-25 10:44:28 +00:00
Conrad Ludgate	8daebb6ed4	[proxy] remove TokioMechanism and HyperMechanism (#12672 ) Another go at #12341. LKB-2497 We now only need 1 connect mechanism (and 1 more for testing) which saves us some code and complexity. We should be able to remove the final connect mechanism when we create a separate worker task for pglb->compute connections - either via QUIC streams or via in-memory channels. This also now ensures that connect_once always returns a ConnectionError type - something simple enough we can probably define a serialisation for in pglb. * I've abstracted connect_to_compute to always use TcpMechanism and the ProxyConfig. * I've abstracted connect_to_compute_and_auth to perform authentication, managing any retries for stale computes * I had to introduce a separate `managed` function for taking ownership of the compute connection into the Client/Connection pair	2025-07-24 12:37:04 +00:00
Conrad Ludgate	a695713727	[sql-over-http] Reset session state between pooled connection re-use (#12681 ) Session variables can be set during one sql-over-http query and observed on another when that pooled connection is re-used. To address this we can use `RESET ALL;` before re-using the connection. LKB-2495 To be on the safe side, we can opt for a full `DISCARD ALL;`, but that might have performance regressions since it also clears any query plans. See pgbouncer docs https://www.pgbouncer.org/config.html#server_reset_query. `DISCARD ALL` is currently defined as: ``` CLOSE ALL; SET SESSION AUTHORIZATION DEFAULT; RESET ALL; DEALLOCATE ALL; UNLISTEN *; SELECT pg_advisory_unlock_all(); DISCARD PLANS; DISCARD TEMP; DISCARD SEQUENCES; ``` I've opted to keep everything here except the `DISCARD PLANS`. I've modified the code so that this query is executed in the background when a connection is returned to the pool, rather than when taken from the pool. This should marginally improve performance for Neon RLS by removing 1 (localhost) round trip. I don't believe that keeping query plans could be a security concern. It's a potential side channel, but I can't imagine what you could extract from it. --- Thanks to https://github.com/neondatabase/neon/pull/12659#discussion_r2219016205 for probing the idea in my head.	2025-07-23 17:43:43 +00:00
Conrad Ludgate	761e9e0e1d	[proxy] move `read_info` from the compute connection to be as late as possible (#12660 ) Second attempt at #12130, now with a smaller diff. This allows us to skip allocating for things like parameter status and notices that we will either just forward untouched, or discard. LKB-2494	2025-07-23 13:33:21 +00:00
Folke Behrens	108f7ec544	Bump opentelemetry crates to 0.30 (#12680 ) This rebuilds #11552 on top the current Cargo.lock. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2025-07-22 16:05:35 +00:00
Folke Behrens	9c0efba91e	Bump rand crate to 0.9 (#12674 )	2025-07-22 09:31:39 +00:00
Ivan Efremov	050c9f704f	proxy: expose session_id to clients and proxy latency to probes (#12656 ) Implements #8728	2025-07-21 20:27:15 +00:00
Ruslan Talpa	0dbe551802	proxy: subzero integration in auth-broker (embedded data-api) (#12474 ) ## Problem We want to have the data-api served by the proxy directly instead of relying on a 3rd party to run a deployment for each project/endpoint. ## Summary of changes With the changes below, the proxy (auth-broker) becomes also a "rest-broker", that can be thought of as a "Multi-tenant" data-api which provides an automated REST api for all the databases in the region. The core of the implementation (that leverages the subzero library) is in proxy/src/serverless/rest.rs and this is the only place that has "new logic". --------- Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-07-21 18:16:28 +00:00
Conrad Ludgate	b2ecb10f91	[proxy] rework handling of notices in sql-over-http (#12659 ) A replacement for #10254 which allows us to introduce notice messages for sql-over-http in the future if we want to. This also removes the `ParameterStatus` and `Notification` handling as there's nothing we could/should do for those.	2025-07-21 12:50:13 +00:00
Krzysztof Szafrański	96bcfba79e	[proxy] Cache GetEndpointAccessControl errors (#12571 ) Related to https://github.com/neondatabase/cloud/issues/19353	2025-07-18 10:17:58 +00:00
Folke Behrens	64d0008389	proxy: Shorten the initial TTL of cancel keys (#12647 ) ## Problem A high rate of short-lived connections means that there a lot of cancel keys in Redis with TTL=10min that could be avoided by having a much shorter initial TTL. ## Summary of changes * Introduce an initial TTL of 1min used with the SET command. * Fix: don't delay repushing cancel data when expired. * Prepare for exponentially increasing TTLs. ## Alternatives A best-effort UNLINK command on connection termination would clean up cancel keys right away. This needs a bigger refactor due to how batching is handled.	2025-07-17 21:52:20 +00:00
Krzysztof Szafrański	e2982ed3ec	[proxy] Cache node info only for TTL, even if Redis is available (#12626 ) This PR simplifies our node info cache. Now we'll store entries for at most the TTL duration, even if Redis notifications are available. This will allow us to cache intermittent errors later (e.g. due to rate limits) with more predictable behavior. Related to https://github.com/neondatabase/cloud/issues/19353	2025-07-16 16:23:05 +00:00
Conrad Ludgate	c71aea0223	proxy: for json logging, only use callsite IDs if span name is duplicated (#12625 ) ## Problem We run multiple proxies, we get logs like ``` ... spans={"http_conn#22":{"conn_id": ... ... spans={"http_conn#24":{"conn_id": ... ``` these are the same span, and the difference is confusing. ## Summary of changes Introduce a counter per span name, rather than a global counter. If the counter is 0, no change to the span name is made. To follow up: see which span names are duplicated within the codebase in different callsites	2025-07-16 13:29:18 +00:00
Conrad Ludgate	87915df2fa	proxy: replace serde_json with our new json ser crate in the logging impl (#12602 ) This doesn't solve any particular problem, but it does simplify some of the code that was forced to round-trip through verbose Serialize impls.	2025-07-16 13:27:00 +00:00
Heikki Linnakangas	5c9c3b3317	Misc cosmetic cleanups (#12598 ) - Remove a few obsolete "allowed error messages" from tests. The pageserver doesn't emit those messages anymore. - Remove misplaced and outdated docstring comment from `test_tenants.py`. A docstring is supposed to be the first thing in a function, but we had added some code before it. And it was outdated, as we haven't supported running without safekeepers for a long time. - Fix misc typos in comments - Remove obsolete comment about backwards compatibility with safekeepers without `TIMELINE_STATUS` API. All safekeepers have it by now.	2025-07-15 14:36:28 +00:00
Krzysztof Szafrański	ff526a1051	[proxy] Recognize more cplane errors, use retry_delay_ms as TTL (#12543 ) ## Problem Not all cplane errors are properly recognized and cached/retried. ## Summary of changes Add more cplane error reasons. Also, use retry_delay_ms as cache TTL if present. Related to https://github.com/neondatabase/cloud/issues/19353	2025-07-15 07:42:48 +00:00
Folke Behrens	296c9190b2	proxy: Use EXPIRE command to refresh cancel entries (#12580 ) ## Problem When refreshing cancellation data we resend the entire value again just to reset the TTL, which causes unnecessary load in proxy, on network and possibly on redis side. ## Summary of changes * Switch from using SET with full value to using EXPIRE to reset TTL. * Add a tiny delay between retries to prevent busy loop. * Shorten CancelKeyOp variants: drop redundant suffix. * Retry SET when EXPIRE failed.	2025-07-13 22:49:23 +00:00
Folke Behrens	a5fe67f361	proxy: cancel maintain_cancel_key task immediately (#12586 ) ## Problem When a connection terminates its maintain_cancel_key task keeps running until the CANCEL_KEY_REFRESH sleep finishes and then it triggers another cancel key TTL refresh before exiting. ## Summary of changes * Check for cancellation while sleeping and interrupt sleep. * If cancelled, break the loop, don't send a refresh cmd.	2025-07-13 17:27:39 +00:00
Conrad Ludgate	9bba31bf68	proxy: encode json as we parse rows (#11992 ) Serialize query row responses directly into JSON. Some of this code should be using the `json::value_as_object/list` macros, but I've avoided it for now to minimize the size of the diff.	2025-07-11 19:39:08 +00:00
Folke Behrens	380d167b7c	proxy: For cancellation data replace HSET+EXPIRE/HGET with SET..EX/GET (#12553 ) ## Problem To store cancellation data we send two commands to redis because the redis server version doesn't support HSET with EX. Also, HSET is not really needed. ## Summary of changes * Replace the HSET + EXPIRE command pair with one SET .. EX command. * Replace HGET with GET. * Leave a workaround for old keys set with HSET. * Replace some anyhow errors with specific errors to surface the WRONGTYPE error from redis.	2025-07-11 19:35:42 +00:00
Conrad Ludgate	f4245403b3	[proxy] allow testing query cancellation locally (#12568 ) ## Problem Canceelation requires redis, redis required control-plane. ## Summary of changes Make redis for cancellation not require control plane. Add instructions for setting up redis locally.	2025-07-11 15:13:36 +00:00
Vlad Lazar	fe0ddb7169	libs: make remote storage failure injection probabilistic (#12526 ) Change the unreliable storage wrapper to fail by probability when there are more failure attempts left. Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>	2025-07-09 17:41:34 +00:00
Folke Behrens	5ea0bb2d4f	proxy: Drop unused metrics (#12521 ) * proxy_control_plane_token_acquire_seconds * proxy_allowed_ips_cache_misses * proxy_vpc_endpoint_id_cache_stats * proxy_access_blocker_flags_cache_stats * proxy_requests_auth_rate_limits_total * proxy_endpoints_auth_rate_limits * proxy_invalid_endpoints_total	2025-07-09 09:58:46 +00:00
Folke Behrens	e65d5f7369	proxy: Remove the endpoint filter cache (#12488 ) ## Problem The endpoint filter cache is still unused because it's not yet reliable enough to be used. It only consumes a lot of memory. ## Summary of changes Remove the code. Needs a new design. neondatabase/cloud#30634	2025-07-07 17:46:33 +00:00
Conrad Ludgate	03e604e432	Nightly lints and small tweaks (#12456 ) Let chains available in 1.88 :D new clippy lints coming up in future releases.	2025-07-03 14:47:12 +00:00
Ruslan Talpa	95e1011cd6	subzero pre-integration refactor (#12416 ) ## Problem integrating subzero requires a bit of refactoring. To make the integration PR a bit more manageable, the refactoring is done in this separate PR. ## Summary of changes * move common types/functions used in sql_over_http to errors.rs and http_util.rs * add the "Local" auth backend to proxy (similar to local_proxy), useful in local testing * change the Connect and Send type for the http client to allow for custom body when making post requests to local_proxy from the proxy --------- Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>	2025-07-03 11:04:08 +00:00
Conrad Ludgate	1bc1eae5e8	fix redis credentials check (#12455 ) ## Problem `keep_connection` does not exit, so it was never setting `credentials_refreshed`. ## Summary of changes Set `credentials_refreshed` to true when we first establish a connection, and after we re-authenticate the connection.	2025-07-03 09:51:35 +00:00
Folke Behrens	3415b90e88	proxy/logging: Add "ep" and "query_id" to list of extracted fields (#12437 ) Extract two more interesting fields from spans: ep (endpoint) and query_id. Useful for reliable filtering in logging.	2025-07-03 08:09:10 +00:00
Conrad Ludgate	e01c8f238c	[proxy] update noisy error logging (#12438 ) Health checks for pg-sni-router open a TCP connection and immediately close it again. This is noisy. We will filter out any EOF errors on the first message. "acquired permit" debug log is incorrect since it logs when we timedout as well. This fixes the debug log.	2025-07-03 07:46:48 +00:00
Conrad Ludgate	45607cbe0c	[local_proxy]: ignore TLS for endpoint (#12316 ) ## Problem When local proxy is configured with TLS, the certificate does not match the endpoint string. This currently returns an error. ## Summary of changes I don't think this code is necessary anymore, taking the prefix from the hostname is good enough (and is equivalent to what `endpoint_sni` was doing) and we ignore checking the domain suffix.	2025-07-03 07:35:57 +00:00
Conrad Ludgate	d6beb3ffbb	[proxy] rewrite pg-text to json routines (#12413 ) We would like to move towards an arena system for JSON encoding the responses. This change pushes an "out" parameter into the pg-test to json routines to make swapping in an arena system easier in the future. (see #11992) This additionally removes the redundant `column: &[Type]` argument, as well as rewriting the pg_array parser. --- I rewrote the pg_array parser since while making these changes I found it hard to reason about. I went back to the specification and rewrote it from scratch. There's 4 separate routines: 1. pg_array_parse - checks for any prelude (multidimensional array ranges) 2. pg_array_parse_inner - only deals with the arrays themselves 3. pg_array_parse_item - parses a single item from the array, this might be quoted, unquoted, or another nested array. 4. pg_array_parse_quoted - parses a quoted string, following the relevant string escaping rules.	2025-07-02 12:46:11 +00:00
Ivan Efremov	0f879a2e8f	[proxy]: Fix redis IRSA expiration failure errors (#12430 ) Relates to the [#30688](https://github.com/neondatabase/cloud/issues/30688)	2025-07-02 08:55:44 +00:00
Conrad Ludgate	4932963bac	[proxy]: dont log user errors from postgres (#12412 ) ## Problem #8843 User initiated sql queries are being classified as "postgres" errors, whereas they're really user errors. ## Summary of changes Classify user-initiated postgres errors as user errors if they are related to a sql query that we ran on their behalf. Do not log those errors.	2025-07-01 13:03:34 +00:00

1 2 3 4 5 ...

696 Commits