rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 13:32:57 +00:00

Author	SHA1	Message	Date
George MacKerron	d8977d5199	Altered retry timing parameters for connect to compute, to get more and quicker retries (#5358 ) ## Problem Compute start time has improved, but the timing of connection retries from the proxy is rather slow, meaning we could be making clients wait hundreds of milliseconds longer than necessary. ## Summary of changes Previously, retry time in ms was `100 * 1.5*n`, and `n` starts at 1, giving: 150, 225, 337, 506, 759, 1139, 1709, ... This PR changes that to `25 sqrt(2)**(n - 1)` instead, giving: 25, 35, 50, 71, 100, 141, 200, ...	2023-09-25 12:27:41 +01:00
Joonas Koivunen	c5d226d9c7	refactor(consumption_metrics): prereq refactorings, tests (#5315 ) Split off from #5297. There should be no functional changes here: - refactor tenant metric "production" like previously timeline, allows unit testing, though not interesting enough yet to test - introduce type aliases for tuples - extra refactoring for `collect`, was initially thinking it was useful but will do a inline later - shorter binding names - support for future allocation reuse quests with IdempotencyKey - move code out of tokio::select to make it rustfmt-able - generification, allow later replacement of `&'static str` with enum - add tests that assert sent event contents exactly	2023-09-15 19:44:14 +03:00
Nikita Kalyanov	77658a155b	support deploying in IPv6-only environments (#4135 ) A set of changes to enable neon to work in IPv6 environments. The changes are backward-compatible but allow to deploy neon even to IPv6-only environments: - bind to both IPv4 and IPv6 interfaces - allow connections to Postgres from IPv6 interface - parse the address from control plane that could also be IPv6	2023-09-05 12:45:46 +03:00
Conrad Ludgate	44da9c38e0	proxy: error typo (#5187 ) ## Problem https://github.com/neondatabase/neon/pull/5162#discussion_r1311853491	2023-09-01 19:21:33 +03:00
Conrad Ludgate	1b916a105a	proxy: locked is not retriable (#5162 ) ## Problem Management service returns Locked when quotas are exhausted. We cannot retry on those ## Summary of changes Makes Locked status unretriable	2023-08-31 15:50:15 +03:00
Conrad Ludgate	d11621d904	Proxy: proxy protocol v2 (#5028 ) ## Problem We need to log the client IP, not the IP of the NLB. ## Summary of changes Parse the proxy [protocol version 2](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) if possible	2023-08-31 14:30:25 +03:00
Nikita Kalyanov	b9c111962f	pass JWT to management API (#5151 ) support authentication with JWT from env for proxy calls to mgmt API	2023-08-31 12:23:51 +03:00
Conrad Ludgate	93dcdb293a	proxy: password hack hack (#5126 ) ## Problem fixes #4881 ## Summary of changes	2023-08-30 16:20:27 +01:00
Conrad Ludgate	3b81e0c86d	chore: remove webpki (#5069 ) ## Problem webpki is unmaintained Closes https://github.com/neondatabase/neon/security/dependabot/33 ## Summary of changes Update all dependents of webpki.	2023-08-30 15:14:03 +01:00
Conrad Ludgate	faf070f288	proxy: dont return connection pending (#5107 ) ## Problem We were returning Pending when a connection had a notice/notification (introduced recently in #5020). When returning pending, the runtime assumes you will call `cx.waker().wake()` in order to continue processing. We weren't doing that, so the connection task would get stuck ## Summary of changes Don't return pending. Loop instead	2023-08-25 15:08:45 +03:00
Conrad Ludgate	0b001a0001	proxy: remove connections on shutdown (#5051 ) ## Problem On shutdown, proxy connections are staying open. ## Summary of changes Remove the connections on shutdown	2023-08-21 19:20:58 +01:00
Conrad Ludgate	ec10838aa4	proxy: pool connection logs (#5020 ) ## Problem Errors and notices that happen during a pooled connection lifecycle have no session identifiers ## Summary of changes Using a watch channel, we set the session ID whenever it changes. This way we can see the status of a connection for that session Also, adding a connection id to be able to search the entire connection lifecycle	2023-08-18 11:44:08 +01:00
Conrad Ludgate	3e4710c59e	proxy: add more sasl logs (#5012 ) ## Problem A customer is having trouble connecting to neon from their production environment. The logs show a mix of "Internal error" and "authentication protocol violation" but not the full error ## Summary of changes Make sure we don't miss any logs during SASL/SCRAM	2023-08-17 12:05:54 +01:00
Conrad Ludgate	25934ec1ba	proxy: reduce global conn pool contention (#4747 ) ## Problem As documented, the global connection pool will be high contention. ## Summary of changes Use DashMap rather than Mutex<HashMap>. Of note, DashMap currently uses a RwLock internally, but it's partially sharded to reduce contention by a factor of N. We could potentially use flurry which is a port of Java's concurrent hashmap, but I have no good understanding of it's performance characteristics. Dashmap is at least equivalent to hashmap but less contention. See the read heavy benchmark to analyse our expected performance <https://github.com/xacrimon/conc-map-bench#ready-heavy> I also spoke with the developer of dashmap recently, and they are working on porting the implementation to use concurrent HAMT FWIW	2023-08-16 17:20:28 +01:00
Arthur Petukhovsky	1b97a3074c	Disable neon-pool-opt-in (#4995 )	2023-08-15 20:57:56 +03:00
George MacKerron	218be9eb32	Added deferrable transaction option to http batch queries (#4993 ) ## Problem HTTP batch queries currently allow us to set the isolation level and read only, but not deferrable. ## Summary of changes Add support for deferrable. Echo deferrable status in response headers only if true. Likewise, now echo read-only status in response headers only if true.	2023-08-15 14:52:00 +01:00
George MacKerron	1ca08cc523	Changed batch query body to from [...] to { queries: [...] } (#4975 ) ## Problem It's nice if `single query : single response :: batch query : batch response`. But at present, in the single case we send `{ query: '', params: [] }` and get back a single `{ rows: [], ... }` object, while in the batch case we send an array of `{ query: '', params: [] }` objects and get back not an array of `{ rows: [], ... }` objects but a `{ results: [ { rows: [] , ... }, { rows: [] , ... }, ... ] }` object instead. ## Summary of changes With this change, the batch query body becomes `{ queries: [{ query: '', params: [] }, ... ] }`, which restores a consistent relationship between the request and response bodies.	2023-08-14 16:07:33 +01:00
Arthur Petukhovsky	3a6b99f03c	proxy: improve http logs (#4976 ) Fix multiline logs on websocket errors and always print sql-over-http errors sent to the user.	2023-08-11 18:18:07 +03:00
Arthur Petukhovsky	73d7a9bc6e	proxy: propagate ws span (#4966 ) Found this log on staging: ``` 2023-08-10T17:42:58.573790Z INFO handling interactive connection from client protocol="ws" ``` We seem to be losing websocket span in spawn, this patch fixes it.	2023-08-10 23:38:22 +03:00
Conrad Ludgate	25c66dc635	proxy: http logging to 11 (#4950 ) ## Problem Mysterious network issues ## Summary of changes Log a lot more about HTTP/DNS in hopes of detecting more of the network errors	2023-08-10 17:49:24 +01:00
George MacKerron	538373019a	Increase max sql-over-http response size from 1MB to 10MB (#4961 ) ## Problem 1MB response limit is very small. ## Summary of changes This data is not yet tracked, so we shoudn't raise the limit too high yet. But as discussed with @kelvich and @conradludgate, this PR lifts it to 10MB, and adds also details of the limit to the error response.	2023-08-10 17:21:52 +01:00
Conrad Ludgate	0fa85aa08e	proxy: delay auth on retry (#4929 ) ## Problem When an endpoint is shutting down, it can take a few seconds. Currently when starting a new compute, this causes an "endpoint is in transition" error. We need to add delays before retrying to ensure that we allow time for the endpoint to shutdown properly. ## Summary of changes Adds a delay before retrying in auth. connect_to_compute already has this delay	2023-08-08 17:19:24 +03:00
Conrad Ludgate	606caa0c5d	proxy: update logs and span data to be consistent and have more info (#4878 ) ## Problem Pre-requisites for #4852 and #4853 ## Summary of changes 1. Includes the client's IP address (which we already log) with the span info so we can have it on all associated logs. This makes making dashboards based on IP addresses easier. 2. Switch to a consistent error/warning log for errors during connection. This includes error, num_retries, retriable=true/false and a consistent log message that we can grep for.	2023-08-04 12:37:18 +03:00
Joonas Koivunen	3a00a5deb2	refactor: tidy consumption metrics (#4860 ) Tidying up I've been wanting to do for some time. Follow-up to #4857.	2023-08-01 18:14:16 +03:00
Alex Chi Z	7b6c849456	support isolation level + read only for http batch sql (#4830 ) We will retrieve `neon-batch-isolation-level` and `neon-batch-read-only` from the http header, which sets the txn properties. https://github.com/neondatabase/serverless/pull/38#issuecomment-1653130981 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-08-01 02:59:11 +03:00
Conrad Ludgate	eb78603121	proxy: div by zero (#4845 ) ## Problem 1. In the CacheInvalid state loop, we weren't checking the `num_retries`. If this managed to get up to `32`, the retry_after procedure would compute 2^32 which would overflow to 0 and trigger a div by zero 2. When fixing the above, I started working on a flow diagram for the state machine logic and realised it was more complex than it had to be: a. We start in a `Cached` state b. `Cached`: call `connect_once`. After the first connect_once error, we always move to the `CacheInvalid` state, otherwise, we return the connection. c. `CacheInvalid`: we attempt to `wake_compute` and we either switch to Cached or we retry this step (or we error). d. `Cached`: call `connect_once`. We either retry this step or we have a connection (or we error) - After num_retries > 1 we never switch back to `CacheInvalid`. ## Summary of changes 1. Insert a `num_retries` check in the `handle_try_wake` procedure. Also using floats in the retry_after procedure to prevent the overflow entirely 2. Refactor connect_to_compute to be more linear in design.	2023-07-31 09:30:24 -04:00
Alex Chi Z	a8f3540f3d	proxy: add unit test for wake_compute (#4819 ) ## Problem ref https://github.com/neondatabase/neon/pull/4721, ref https://github.com/neondatabase/neon/issues/4709 ## Summary of changes This PR adds unit tests for wake_compute. The patch adds a new variant `Test` to auth backends. When `wake_compute` is called, we will verify if it is the exact operation sequence we are expecting. The operation sequence now contains 3 more operations: `Wake`, `WakeRetry`, and `WakeFail`. The unit tests for proxy connects are now complete and I'll continue work on WebSocket e2e test in future PRs. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-28 19:10:55 -04:00
Conrad Ludgate	231d7a7616	proxy: retry compute wake in auth (#4817 ) ## Problem wake_compute can fail sometimes but is eligible for retries. We retry during the main connect, but not during auth. ## Summary of changes retry wake_compute during auth flow if there was an error talking to control plane, or if there was a temporary error in waking the compute node	2023-07-26 16:34:46 +01:00
Conrad Ludgate	35370f967f	proxy: add some connection init logs (#4812 ) ## Problem The first session event we emit is after we receive the first startup packet from the client. This means we can't detect any issues between TCP open and handling of the first PG packet ## Summary of changes Add some new logs for websocket upgrade and connection handling	2023-07-26 15:03:51 +00:00
Conrad Ludgate	2ebd2ce2b6	proxy: record connection type (#4802 ) ## Problem We want to measure how many users are using TCP/WS connections. We also want to measure how long it takes to establish a connection with the compute node. I plan to also add a separate counter for HTTP requests, but because of pooling this needs to be disambiguated against new HTTP compute connections ## Summary of changes * record connection type (ws/tcp) in the connection counters. * record connection latency including retry latency	2023-07-25 18:57:42 +03:00
Alex Chi Z	bcc2aee704	proxy: add tests for batch http sql (#4793 ) This PR adds an integration test case for batch HTTP SQL endpoint. https://github.com/neondatabase/neon/pull/4654/ should be merged first before we land this PR. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-25 15:08:24 +00:00
Nick Randall	062159ac17	support non-interactive transactions in sql-over-http (#4654 ) This PR adds support for non-interactive transaction query endpoint. It accepts an array of queries and parameters and returns an array of query results. The queries will be run in a single transaction one after another on the proxy side.	2023-07-25 13:03:55 +01:00
Alex Chi Z	407a20ceae	add proxy unit tests for retry connections (#4721 ) Given now we've refactored `connect_to_compute` as a generic, we can test it with mock backends. In this PR, we mock the error API and connect_once API to test the retry behavior of `connect_to_compute`. In the next PR, I'll add mock for credentials so that we can also test behavior with `wake_compute`. ref https://github.com/neondatabase/neon/issues/4709 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-24 20:41:42 +03:00
Conrad Ludgate	2e8a3afab1	proxy: merge handle_client (#4740 ) ## Problem Second half of #4699. we were maintaining 2 implementations of handle_client. ## Summary of changes Merge the handle_client code, but abstract some of the details. ## Checklist before requesting a review - [X] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-07-17 22:20:23 +01:00
Conrad Ludgate	e074ccf170	reduce proxy timeouts (#4708 ) ## Problem 10 retries * 10 second timeouts makes for a very long retry window. ## Summary of changes Adds a 2s timeout to sql_over_http connections, and also reduces the 10s timeout in TCP.	2023-07-17 20:05:26 +01:00
George MacKerron	196943c78f	CORS preflight OPTIONS support for /sql (http fetch) endpoint (#4706 ) ## Problem HTTP fetch can't be used from browsers because proxy doesn't support [CORS 'preflight' `OPTIONS` requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS#preflighted_requests). ## Summary of changes Added a simple `OPTIONS` endpoint for `/sql`.	2023-07-17 20:01:25 +01:00
Conrad Ludgate	7c85c7ea91	proxy: merge connect compute (#4713 ) ## Problem Half of #4699. TCP/WS have one implementation of `connect_to_compute`, HTTP has another implementation of `connect_to_compute`. Having both is annoying to deal with. ## Summary of changes Creates a set of traits `ConnectMechanism` and `ShouldError` that allows the `connect_to_compute` to be generic over raw TCP stream or tokio_postgres based connections. I'm not super happy with this. I think it would be nice to remove tokio_postgres entirely but that will need a lot more thought to be put into it. I have also slightly refactored the caching to use fewer references. Instead using ownership to ensure the state of retrying is encoded in the type system.	2023-07-17 15:53:01 +01:00
arpad-m	982fce1e72	Fix rustdoc warnings and test cargo doc in CI (#4711 ) ## Problem `cargo +nightly doc` is giving a lot of warnings: broken links, naked URLs, etc. ## Summary of changes * update the `proc-macro2` dependency so that it can compile on latest Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and https://github.com/dtolnay/proc-macro2/issues/398 * allow the `private_intra_doc_links` lint, as linking to something that's private is always more useful than just mentioning it without a link: if the link breaks in the future, at least there is a warning due to that. Also, one might enable [`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options) in the future and make these links work in general. * fix all the remaining warnings given by `cargo +nightly doc` * make it possible to run `cargo doc` on stable Rust by updating `opentelemetry` and associated crates to version 0.19, pulling in a fix that previously broke `cargo doc` on stable: https://github.com/open-telemetry/opentelemetry-rust/pull/904 * Add `cargo doc` to CI to ensure that it won't get broken in the future. Fixes #2557 ## Future work * Potentially, it might make sense, for development purposes, to publish the generated rustdocs somewhere, like for example [how the rust compiler does it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html). I will file an issue for discussion.	2023-07-15 05:11:25 +03:00
Alex Chi Z	1309571f5d	proxy: switch to structopt for clap parsing (#4714 ) Using `#[clap]` for parsing cli opts, which is easier to maintain. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-14 19:11:01 +03:00
Conrad Ludgate	db4d094afa	proxy: add more error cases to retry connect (#4707 ) ## Problem In the logs, I noticed we still weren't retrying in some cases. Seemed to be timeouts but we explicitly wanted to handle those ## Summary of changes Retry on io::ErrorKind::TimedOut errors. Handle IO errors in tokio_postgres::Error.	2023-07-13 11:47:27 +01:00
Conrad Ludgate	0626e0bfd3	proxy: refactor some error handling and shutdowns (#4684 ) ## Problem It took me a while to understand the purpose of all the tasks spawned in the main functions. ## Summary of changes Utilising the type system and less macros, plus much more comments, document the shutdown procedure of each task in detail	2023-07-13 11:03:37 +01:00
Conrad Ludgate	a1d6b1a4af	proxy wake_compute loop (#4675 ) ## Problem If we fail to wake up the compute node, a subsequent connect attempt will definitely fail. However, kubernetes won't fail the connection immediately, instead it hangs until we timeout (10s). ## Summary of changes Refactor the loop to allow fast retries of compute_wake and to skip a connect attempt.	2023-07-12 11:38:36 +01:00
Conrad Ludgate	ac758e4f51	allow repeated IO errors from compute node (#4624 ) ## Problem #4598 compute nodes are not accessible some time after wake up due to kubernetes DNS not being fully propagated. ## Summary of changes Update connect retry mechanism to support handling IO errors and sleeping for 100ms ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-07-07 19:50:50 +03:00
Stas Kelvich	dbf88cf2d7	Minimalistic pool for http endpoint compute connections (under opt-in flag) Cache up to 20 connections per endpoint. Once all pooled connections are used current implementation can open an extra connection, so the maximum number of simultaneous connections is not enforced. There are more things to do here, especially with background clean-up of closed connections, and checks for transaction state. But current implementation allows to check for smaller coonection latencies that this cache should bring.	2023-07-05 12:00:03 +03:00
Stas Kelvich	cbf9a40889	Set a shorter timeout for the initial connection attempti in proxy. In case we try to connect to an outdated address that is no longer valid, the default behavior of Kubernetes is to drop the packets, causing us to wait for the entire timeout period. We want to fail fast in such cases. A specific case to consider is when we have cached compute node information with a 5-minute TTL (Time To Live), but the user has executed a `/suspend` API call, resulting in the nonexistence of the compute node.	2023-07-04 20:34:22 +03:00
Conrad Ludgate	ab2ea8cfa5	use pbkdf2 crate (#4626 ) ## Problem While pbkdf2 is a simple algorithm, we should probably use a well tested implementation ## Summary of changes * Use pbkdf2 crate * Use arrays like the hmac comment says ## Checklist before requesting a review - [X] I have performed a self-review of my code. - [X] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-07-04 14:54:59 +01:00
Joonas Koivunen	cff7ae0b0d	fix: no more ansi colored logs (#4613 ) Allure does not support ansi colored logs, yet `compute_ctl` has them. Upgrade criterion to get rid of atty dependency, disable ansi colors, remove atty dependency and disable ansi feature of tracing-subscriber. This is a heavy-handed approach. I am not aware of a workflow where you'd want to connect a terminal directly to for example `compute_ctl`, usually you find the logs in a file. If someone had been using colors, they will now need to: - turn the `tracing-subscriber.default-features` to `true` - edit their wanted project to have colors I decided to explicitly disable ansi colors in case we would have in future a dependency accidentally enabling the feature on `tracing-subscriber`, which would be quite surprising but not unimagineable. By getting rid of `atty` from dependencies we get rid of <https://github.com/advisories/GHSA-g98v-hv3f-hcfr>.	2023-07-03 16:37:02 +03:00
Shany Pozin	26828560a8	Add timeouts and retries to consumption metrics reporting client (#4563 ) ## Problem #4528 ## Summary of changes Add a 60 seconds default timeout to the reqwest client Add retries for up to 3 times to call into the metric consumption endpoint --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-07-03 15:20:49 +03:00
Joonas Koivunen	4957bb2d48	fix(proxy): stray span enter globbers up logs (#4612 ) Prod logs have deep accidential span nesting. Introduced in #3759 and has been untouched since, maybe no one watches proxy logs? :) I found it by accident when looking to see if proxy logs have ansi colors with `{neon_service="proxy"}`. The solution is to mostly stop using `Span::enter` or `Span::entered` in async code. Kept on `Span::entered` in cancel on shutdown related path.	2023-07-03 11:53:57 +01:00
Arthur Petukhovsky	a7a0c3cd27	Invalidate proxy cache in http-over-sql (#4500 ) HTTP queries failed with errors `error connecting to server: failed to lookup address information: Name or service not known\n\nCaused by:\n failed to lookup address information: Name or service not known` The fix reused cache invalidation logic in proxy from usual postgres connections and added it to HTTP-over-SQL queries. Also removed a timeout for HTTP request, because it almost never worked on staging (50s+ time just to start the compute), and we can have the similar case in production. Should be ok, since we have a limits for the requests and responses.	2023-06-14 19:24:46 +03:00

... 8 9 10 11 12 ...

666 Commits