Commit Graph

688 Commits

Author SHA1 Message Date
Ruslan Talpa
6549708b44 change subzero dep sha 2025-07-04 13:39:49 +03:00
Ruslan Talpa
9f46ca5eb1 Merge branch 'main' into ruslan/subzero-integration 2025-07-04 13:03:55 +03:00
Ruslan Talpa
54da030a2d place the entire rest_broker code under a feature flag 2025-07-04 12:46:48 +03:00
Ruslan Talpa
afa4e48071 put subzero dependency under a feature flag 2025-07-04 11:27:36 +03:00
Conrad Ludgate
03e604e432 Nightly lints and small tweaks (#12456)
Let chains available in 1.88 :D new clippy lints coming up in future
releases.
2025-07-03 14:47:12 +00:00
Ruslan Talpa
b54872a4dc fix error after merging latest master 2025-07-03 17:24:13 +03:00
Ruslan Talpa
486829f875 Merge branch 'main' into ruslan/subzero-integration 2025-07-03 17:10:43 +03:00
Ruslan Talpa
95e1011cd6 subzero pre-integration refactor (#12416)
## Problem
integrating subzero requires a bit of refactoring. To make the
integration PR a bit more manageable, the refactoring is done in this
separate PR.
 
## Summary of changes
* move common types/functions used in sql_over_http to errors.rs and
http_util.rs
* add the "Local" auth backend to proxy (similar to local_proxy), useful
in local testing
* change the Connect and Send type for the http client to allow for
custom body when making post requests to local_proxy from the proxy

---------

Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>
2025-07-03 11:04:08 +00:00
Conrad Ludgate
1bc1eae5e8 fix redis credentials check (#12455)
## Problem

`keep_connection` does not exit, so it was never setting
`credentials_refreshed`.

## Summary of changes

Set `credentials_refreshed` to true when we first establish a
connection, and after we re-authenticate the connection.
2025-07-03 09:51:35 +00:00
Folke Behrens
3415b90e88 proxy/logging: Add "ep" and "query_id" to list of extracted fields (#12437)
Extract two more interesting fields from spans: ep (endpoint) and
query_id.
Useful for reliable filtering in logging.
2025-07-03 08:09:10 +00:00
Conrad Ludgate
e01c8f238c [proxy] update noisy error logging (#12438)
Health checks for pg-sni-router open a TCP connection and immediately
close it again. This is noisy. We will filter out any EOF errors on the
first message.

"acquired permit" debug log is incorrect since it logs when we timedout
as well. This fixes the debug log.
2025-07-03 07:46:48 +00:00
Conrad Ludgate
45607cbe0c [local_proxy]: ignore TLS for endpoint (#12316)
## Problem

When local proxy is configured with TLS, the certificate does not match
the endpoint string. This currently returns an error.

## Summary of changes

I don't think this code is necessary anymore, taking the prefix from the
hostname is good enough (and is equivalent to what `endpoint_sni` was
doing) and we ignore checking the domain suffix.
2025-07-03 07:35:57 +00:00
Conrad Ludgate
d6beb3ffbb [proxy] rewrite pg-text to json routines (#12413)
We would like to move towards an arena system for JSON encoding the
responses. This change pushes an "out" parameter into the pg-test to
json routines to make swapping in an arena system easier in the future.
(see #11992)

This additionally removes the redundant `column: &[Type]` argument, as
well as rewriting the pg_array parser.

---

I rewrote the pg_array parser since while making these changes I found
it hard to reason about. I went back to the specification and rewrote it
from scratch. There's 4 separate routines:
1. pg_array_parse - checks for any prelude (multidimensional array
ranges)
2. pg_array_parse_inner - only deals with the arrays themselves
3. pg_array_parse_item - parses a single item from the array, this might
be quoted, unquoted, or another nested array.
4. pg_array_parse_quoted - parses a quoted string, following the
relevant string escaping rules.
2025-07-02 12:46:11 +00:00
Ivan Efremov
0f879a2e8f [proxy]: Fix redis IRSA expiration failure errors (#12430)
Relates to the
[#30688](https://github.com/neondatabase/cloud/issues/30688)
2025-07-02 08:55:44 +00:00
Conrad Ludgate
4932963bac [proxy]: dont log user errors from postgres (#12412)
## Problem

#8843 

User initiated sql queries are being classified as "postgres" errors,
whereas they're really user errors.

## Summary of changes

Classify user-initiated postgres errors as user errors if they are
related to a sql query that we ran on their behalf. Do not log those
errors.
2025-07-01 13:03:34 +00:00
Ruslan Talpa
4775aa3e01 Merge branch 'main' into ruslan/subzero-integration 2025-07-01 13:46:57 +03:00
Ruslan Talpa
1785f856b6 Move the local auth backend under the "testing" feature 2025-06-30 16:52:45 +03:00
Ruslan Talpa
69b22b05da add in readme the way to run auth/rest broker locally 2025-06-30 16:17:35 +03:00
Ruslan Talpa
bf0007fa96 add note about local confir read code 2025-06-30 16:12:04 +03:00
Ruslan Talpa
a9bbe7b00b remove unused imports 2025-06-30 16:02:30 +03:00
Ruslan Talpa
7e3f64b309 implement local auth backend for proxy and remove control plane hacks 2025-06-30 16:00:43 +03:00
Ruslan Talpa
9480d17de7 fix bug in pickcurrent_chema 2025-06-30 12:53:47 +03:00
Ruslan Talpa
424004ec95 apply cargo fmt 2025-06-30 12:32:47 +03:00
Ruslan Talpa
88d1a78260 cleanup the rest path code 2025-06-30 12:30:33 +03:00
Folke Behrens
0ee15002fc proxy: Move client connection accept and handshake to pglb (#12380)
* This must be a no-op.
* Move proxy::task_main to pglb::task_main.
* Move client accept, TLS and handshake to pglb.
* Keep auth and wake in proxy.
2025-06-27 15:20:23 +00:00
Ruslan Talpa
8e544c7f99 import introspection queries instead of loading from files 2025-06-27 14:40:02 +03:00
Ruslan Talpa
4f49fc5b79 move common error types and http realted functions to error.rs and http_util.rs 2025-06-27 13:37:29 +03:00
Conrad Ludgate
abc1efd5a6 [proxy] fix connect_to_compute retry handling (#12351)
# Problem

In #12335 I moved the `authenticate` method outside of the
`connect_to_compute` loop. This triggered [e2e tests to become
flaky](https://github.com/neondatabase/cloud/pull/30533). This
highlighted an edge case we forgot to consider with that change.

When we connect to compute, the compute IP might be cached. This cache
hit might however be stale. Because we can't validate the IP is
associated with a specific compute-id☨, we will succeed the
connect_to_compute operation and fail when it comes to password
authentication☨☨. Before the change, we were invalidating the cache and
triggering wake_compute if the authentication failed.

Additionally, I noticed some faulty logic I introduced 1 year ago
https://github.com/neondatabase/neon/pull/8141/files#diff-5491e3afe62d8c5c77178149c665603b29d88d3ec2e47fc1b3bb119a0a970afaL145-R147

☨ We can when we roll out TLS, as the certificate common name includes
the compute-id.

☨☨ Technically password authentication could pass for the wrong compute,
but I think this would only happen in the very very rare event that the
IP got reused **and** the compute's endpoint happened to be a
branch/replica.

# Solution

1. Fix the broken logic
2. Simplify cache invalidation (I don't know why it was so convoluted)
3. Add a loop around connect_to_compute + authenticate to re-introduce
the wake_compute invalidation we accidentally removed.

I went with this approach to try and avoid interfering with
https://github.com/neondatabase/neon/compare/main...cloneable/proxy-pglb-connect-compute-split.
The changes made in commit 3 will move into `handle_client_request` I
suspect,
2025-06-27 10:36:27 +00:00
Ruslan Talpa
5461039c3f implement remote config fetch from the db and cache introspected schema 2025-06-27 10:20:26 +03:00
Ruslan Talpa
d6c36d103e subzero integration WIP6
beginning work on introspection of config and schema shape from the database
2025-06-26 14:42:31 +03:00
Conrad Ludgate
fd1e8ec257 [proxy] review and cleanup CLI args (#12167)
I was looking at how we could expose our proxy config as toml again, and
as I was writing out the schema format, I noticed some cruft in our CLI
args that no longer seem to be in use.

The redis change is the most complex, but I am pretty sure it's sound.
Since https://github.com/neondatabase/cloud/pull/15613 cplane longer
publishes to the global redis instance.
2025-06-26 11:25:41 +00:00
Ruslan Talpa
8072fae2fe Merge branch 'main' into ruslan/subzero-integration 2025-06-26 10:19:16 +03:00
Ruslan Talpa
3869d680f9 use a global parsed/cached schema 2025-06-26 10:14:28 +03:00
Conrad Ludgate
517a3d0d86 [proxy]: BatchQueue::call is not cancel safe - make it directly cancellation aware (#12345)
## Problem

https://github.com/neondatabase/cloud/issues/30539

If the current leader cancels the `call` function, then it has removed
the jobs from the queue, but will never finish sending the responses.
Because of this, it is not cancellation safe.

## Summary of changes

Document these functions as not cancellation safe. Move cancellation of
the queued jobs into the queue itself.

## Alternatives considered

1. We could spawn the task that runs the batch, since that won't get
cancelled.
* This requires `fn call(self: Arc<Self>)` or `fn call(&'static self)`.
2. We could add another scopeguard and return the requests back to the
queue.
* This requires that requests are always retry safe, and also requires
requests to be `Clone`.
2025-06-25 14:19:20 +00:00
Ruslan Talpa
d3fa228d92 move subzero local test files to a "dot" folder 2025-06-25 16:43:50 +03:00
Conrad Ludgate
27ca1e21be [console_redirect_proxy]: fix channel binding (#12238)
## Problem

While working more on TLS to compute, I realised that Console Redirect
-> pg-sni-router -> compute would break if channel binding was set to
prefer. This is because the channel binding data would differ between
Console Redirect -> pg-sni-router vs pg-sni-router -> compute.

I also noticed that I actually disabled channel binding in #12145, since
`connect_raw` would think that the connection didn't support TLS.

## Summary of changes

Make sure we specify the channel binding.
Make sure that `connect_raw` can see if we have TLS support.
2025-06-25 13:41:30 +00:00
Ruslan Talpa
be6a259b85 subzero integration WIP5
cleanup and postprocess the response and set the correct headers/status and handle errors
2025-06-25 14:33:45 +03:00
Ruslan Talpa
af3ca24a5e remove unused enum values 2025-06-25 14:32:46 +03:00
Dmitry Savelev
158d84ea30 Switch the billing metrics storage format to ndjson. (#12338)
## Problem

The billing team wants to change the billing events pipeline and use a
common events format in S3 buckets across different event producers.

## Summary of changes

Change the events storage format for billing events from JSON to NDJSON.

Resolves: https://github.com/neondatabase/cloud/issues/29994
2025-06-24 15:36:36 +00:00
Conrad Ludgate
4dd9ca7b04 [proxy]: authenticate to compute after connect_to_compute (#12335)
## Problem

PGLB will do the connect_to_compute logic, neonkeeper will do the
session establishment logic. We should split it.

## Summary of changes

Moves postgres authentication to compute to a separate routine that
happens after connect_to_compute.
2025-06-24 14:15:36 +00:00
Ruslan Talpa
8b44f5b479 subzero integration WIP5
extract the response body from the local proxy response
2025-06-24 17:03:42 +03:00
Ruslan Talpa
d1445cf3eb subzero integration WIP4
queries generated by subzero reach database and execute succesfully
2025-06-24 15:33:51 +03:00
Arpad Müller
552249607d apply clippy fixes for 1.88.0 beta (#12331)
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.

This is therefore a preparation PR (like similar PRs before it).

There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).

The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.

---------

Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2025-06-24 10:12:42 +00:00
Ruslan Talpa
67d3026fc4 subzero integration WIP3
* query makes it to the database
2025-06-23 11:48:55 +03:00
Ruslan Talpa
09e62e9b98 subzero integration WIP2 2025-06-23 10:11:06 +03:00
Ruslan Talpa
e121da4bfc subzero integration WIP1 2025-06-20 15:10:45 +03:00
Conrad Ludgate
a298d2c29b [proxy] replace the batch cancellation queue, shorten the TTL for cancel keys (#11943)
See #11942 

Idea: 
* if connections are short lived, they can get enqueued and then also
remove themselves later if they never made it to redis. This reduces the
load on the queue.
* short lived connections (<10m, most?) will only issue 1 command, we
remove the delete command and rely on ttl.
* we can enqueue as many commands as we want, as we can always cancel
the enqueue, thanks to the ~~intrusive linked lists~~ `BTreeMap`.
2025-06-20 11:48:01 +00:00
Ruslan Talpa
b39f04ab99 add missing parts to make disable_pg_session_jwt flag work 2025-06-20 10:23:42 +03:00
Ruslan Talpa
6bd15908fb make pg_session_jwt instalation optional with a cli flag 2025-06-20 10:17:32 +03:00
Ruslan Talpa
3e36d516c2 vanilla pg dokcer image setup 2025-06-20 09:37:39 +03:00