rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-04 20:12:54 +00:00

Author	SHA1	Message	Date
Luca Bruno	8da3b547f8	proxy/http: switch to typed_json (#8377 ) ## Summary of changes This switches JSON rendering logic to `typed_json` in order to reduce the number of allocations in the HTTP responder path. Followup from https://github.com/neondatabase/neon/pull/8319#issuecomment-2216991760. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2024-07-15 12:38:52 +01:00
Conrad Ludgate	411a130675	Fix nightly warnings 2024 june (#8151 ) ## Problem new clippy warnings on nightly. ## Summary of changes broken up each commit by warning type. 1. Remove some unnecessary refs. 2. In edition 2024, inference will default to `!` and not `()`. 3. Clippy complains about doc comment indentation 4. Fix `Trait + ?Sized` where `Trait: Sized`. 5. diesel_derives triggering `non_local_defintions`	2024-07-12 13:58:04 +01:00
Conrad Ludgate	1afab13ccb	proxy: remove some trace logs (#8334 )	2024-07-10 15:05:25 +01:00
Conrad Ludgate	fe13fccdc2	proxy: pg17 fixes (#8321 ) ## Problem #7809 - we do not support sslnegotiation=direct #7810 - we do not support negotiating down the protocol extensions. ## Summary of changes 1. Same as postgres, check the first startup packet byte for tls header `0x16`, and check the ALPN. 2. Tell clients using protocol >3.0 to downgrade	2024-07-10 09:10:29 +01:00
Christian Schwarz	3f7aebb01c	refactor: postgres_backend: replace abstract shutdown_watcher with CancellationToken (#8295 ) Preliminary refactoring while working on https://github.com/neondatabase/neon/issues/7427 and specifically https://github.com/neondatabase/neon/pull/8286	2024-07-09 21:11:11 +03:00
Conrad Ludgate	4a5b55c834	chore: fix nightly build (#8142 ) ## Problem `cargo +nightly check` fails ## Summary of changes Updates `measured`, `time`, and `crc32c`. * `measured`: updated to fix https://github.com/rust-lang/rust/issues/125763. * `time`: updated to fix https://github.com/rust-lang/rust/issues/125319 * `crc32c`: updated to remove some nightly feature detection with a removed nightly feature	2024-07-09 18:25:49 +01:00
Luca BRUNO	c196cf6ac1	proxy/http: avoid spurious vector reallocations This tweaks the rows-to-JSON rendering logic in order to avoid allocating 0-sized temporary vectors and later growing them to insert elements. As the exact size is known in advance, both vectors can be built with an exact capacity upfront. This will avoid further vector growing/reallocation in the rendering hotpath. Signed-off-by: Luca BRUNO <lucab@lucabruno.net>	2024-07-09 15:20:00 +01:00
Conrad Ludgate	e03c3c9893	proxy: cache certain non-retriable console errors for a short time (#8201 ) ## Problem If there's a quota error, it makes sense to cache it for a short window of time. Many clients do not handle database connection errors gracefully, so just spam retry 🤡 ## Summary of changes Updates the node_info cache to support storing console errors. Store console errors if they cannot be retried (using our own heuristic. should only trigger for quota exceeded errors).	2024-07-04 09:03:03 +01:00
Christian Schwarz	7dcdbaa25e	remote_storage config: move handling of empty inline table `{}` to callers (#8193 ) Before this PR, `RemoteStorageConfig::from_toml` would support deserializing an empty `{}` TOML inline table to a `None`, otherwise try `Some()`. We can instead let * in proxy: let clap derive handle the Option * in PS & SK: assume that if the field is specified, it must be a valid RemtoeStorageConfig (This PR started with a much simpler goal of factoring out the `deserialize_item` function because I need that in another PR).	2024-07-02 12:53:08 +02:00
Conrad Ludgate	d7e349d33c	proxy: report blame for passthrough disconnect io errors (#8170 ) ## Problem Hard to debug the disconnection reason currently. ## Summary of changes Keep track of error-direction, and therefore error source (client vs compute) during passthrough.	2024-06-26 15:11:26 +00:00
Conrad Ludgate	6c5d3b5263	proxy fix wake compute console retry (#8141 ) ## Problem 1. Proxy is retrying errors from cplane that shouldn't be retried 2. ~~Proxy is not using the retry_after_ms value~~ ## Summary of changes 1. Correct the could_retry impl for ConsoleError. 2. ~~Update could_retry interface to support returning a fixed wait duration.~~	2024-06-25 18:07:54 +00:00
Conrad Ludgate	78d9059fc7	proxy: update tokio-postgres to allow arbitrary config params (#8076 ) ## Problem Fixes https://github.com/neondatabase/neon/issues/1287 ## Summary of changes tokio-postgres now supports arbitrary server params through the `param(key, value)` method. Some keys are special so we explicitly filter them out.	2024-06-24 10:20:27 +00:00
Arpad Müller	75747cdbff	Use serde for RemoteStorageConfig parsing (#8126 ) Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the same as #7743 but with less repetitive code, by deriving `Deserialize` impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The disadvantage is less useful error messages. The git history of this PR contains a state where we go via an intermediate representation, leveraging the `serde_json` crate, without it ever being actual json though. Also, the PR adds deserialization tests. Alternative to #7743 .	2024-06-22 17:57:09 +00:00
Conrad Ludgate	b998b70315	proxy: reduce some per-task memory usage (#8095 ) ## Problem Some tasks are using around upwards of 10KB of memory at all times, sometimes having buffers that swing them up to 30MB. ## Summary of changes Split some of the async tasks in selective places and box them as appropriate to try and reduce the constant memory usage. Especially in the locations where the large future is only a small part of the total runtime of the task. Also, reduces the size of the CopyBuffer buffer size from 8KB to 1KB. In my local testing and in staging this had a minor improvement. sadly not the improvement I was hoping for :/ Might have more impact in production	2024-06-19 13:34:15 +01:00
Conrad Ludgate	e6eb0020a1	update rust to 1.79.0 (#8048 ) ## Problem rust 1.79 new enabled by default lints ## Summary of changes * update to rust 1.79 * `s/default_features/default-features/` * fix proxy dead code. * fix pageserver dead code.	2024-06-14 13:23:52 +02:00
Anna Khanova	fbccd1e676	Proxy process updated errors (#8026 ) ## Problem Respect errors classification from cplane	2024-06-13 14:42:26 +02:00
Yuchen Liang	630cfbe420	refactor(pageserver): designated api error type for cancelled request (#7949 ) Closes #7406. ## Problem When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is exposed to handle that case, which verbosely logs the error. However, we don't benefit from having the full backtrace provided by anyhow in this case. ## Summary of changes This PR introduces a new `ApiError` type to handle errors caused by cancelled request more robustly. - A new enum variant `ApiError::Cancelled` - Currently the cancelled request is mapped to status code 500. - Need to handle this error in proxy's `http_util` as well. - Added a failpoint test to simulate cancelled `get_lsn_by_timestamp` request. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-06 14:00:14 +00:00
Anna Khanova	00032c9d9f	[proxy] Fix dynamic rate limiter (#7950 ) ## Problem There was a bug in dynamic rate limiter, which exhausted CPU in proxy and proxy wasn't able to accept any connections. ## Summary of changes 1. `if self.available > 1` -> `if self.available >= 1` 2. remove `timeout_at` to use just timeout 3. remove potential infinite loops which can exhaust CPUs.	2024-06-04 05:07:54 +01:00
Conrad Ludgate	9a081c230f	proxy: lazily parse startup pg params (#7905 ) ## Problem proxy params being a `HashMap<String,String>` when it contains just ``` application_name: psql database: neondb user: neondb_owner ``` is quite wasteful allocation wise. ## Summary of changes Keep the params in the wire protocol form, eg: ``` application_name\0psql\0database\0neondb\0user\0neondb_owner\0 ``` Using a linear search for the map is fast enough at small sizes, which is the normal case.	2024-05-30 11:02:38 +00:00
Conrad Ludgate	fddd11dd1a	proxy: upload postgres connection options as json in the parquet upload (#7903 ) ## Problem https://github.com/neondatabase/cloud/issues/9943 ## Summary of changes Captures the postgres options, converts them to json, uploads them in parquet.	2024-05-30 11:10:27 +01:00
Conrad Ludgate	238fa47bee	proxy fix wake compute rate limit (#7902 ) ## Problem We were rate limiting wake_compute in the wrong place ## Summary of changes Move wake_compute rate limit to after the permit is acquired. Also makes a slight refactor on normalize, as it caught my eye	2024-05-30 11:09:27 +01:00
Conrad Ludgate	c8cebecabf	proxy: reintroduce dynamic limiter for compute lock (#7737 ) ## Problem Computes that are healthy can manage many connection attempts at a time. Unhealthy computes cannot. We initially handled this with a fixed concurrency limit, but it seems this inhibits pgbench. ## Summary of changes Support AIMD for connect_to_compute lock to allow varying the concurrency limit based on compute health	2024-05-29 11:17:05 +01:00
Arpad Müller	14df69d0e3	Drop postgres-native-tls in favour of tokio-postgres-rustls (#7883 ) Get rid of postgres-native-tls and openssl in favour of rustls in our dependency tree. Do further steps to completely remove native-tls and openssl. Among other advantages, this allows us to do static musl builds more easily: #7889	2024-05-28 15:40:52 +00:00
Conrad Ludgate	43f9a16e46	proxy: fix websocket buffering (#7878 ) ## Problem Seems the websocket buffering was broken for large query responses only ## Summary of changes Move buffering until after the underlying stream is ready. Tested locally confirms this fixes the bug. Also fixes the pg-sni-router missing metrics bug	2024-05-24 17:56:12 +01:00
Anna Khanova	cd6d811213	[proxy] Do not fail after parquet upload error (#7858 ) ## Problem If the parquet upload was unsuccessful, it will panic. ## Summary of changes Write error in logs instead.	2024-05-23 09:41:29 +00:00
Conrad Ludgate	9cfe08e3d9	proxy password threadpool (#7806 ) ## Problem Despite making password hashing async, it can still take time away from the network code. ## Summary of changes Introduce a custom threadpool, inspired by rayon. Features: ### Fairness Each task is tagged with it's endpoint ID. The more times we have seen the endpoint, the more likely we are to skip the task if it comes up in the queue. This is using a min-count-sketch estimator for the number of times we have seen the endpoint, resetting it every 1000+ steps. Since tasks are immediately rescheduled if they do not complete, the worker could get stuck in a "always work available loop". To combat this, we check the global queue every 61 steps to ensure all tasks quickly get a worker assigned to them. ### Balanced Using crossbeam_deque, like rayon does, we have workstealing out of the box. I've tested it a fair amount and it seems to balance the workload accordingly	2024-05-22 17:05:43 +00:00
Conrad Ludgate	a5ecca976e	proxy: bump parquet (#7782 ) ## Summary of changes Updates the parquet lib. one change left that we need is in an open PR against upstream, hopefully we can remove the git dependency by 52.0.0 https://github.com/apache/arrow-rs/pull/5773 I'm not sure why the parquet files got a little bit bigger. I tested them and they still open fine. 🤷 side effect of the update, chrono updated and added yet another deprecation warning (hence why the safekeepers change)	2024-05-19 19:45:53 +00:00
Conrad Ludgate	790c05d675	proxy: swap tungstenite for a simpler impl (#7353 ) ## Problem I wanted to do a deep dive of the tungstenite codebase. tokio-tungstenite is incredibly convoluted... In my searching I found [fastwebsockets by deno](https://github.com/denoland/fastwebsockets), but it wasn't quite sufficient. This also removes the default 16MB/64MB frame/message size limitation. framed-websockets solves this by inserting continuation frames for partially received messages, so the whole message does not need to be entirely read into memory. ## Summary of changes I took the fastwebsockets code as a starting off point and rewrote it to be simpler, server-only, and be poll-based to support our Read/Write wrappers. I have replaced our tungstenite code with my framed-websockets fork. <https://github.com/neondatabase/framed-websockets>	2024-05-16 13:05:50 +02:00
Anna Khanova	be1a88e574	Proxy added per ep rate limiter (#7636 ) ## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).	2024-05-10 12:17:00 +02:00
Conrad Ludgate	e3a2631df9	proxy: do not invalidate cache for permit errors (#7652 ) ## Problem If a permit cannot be acquired to connect to compute, the cache is invalidated. This had the observed affect of sending more traffic to ProxyWakeCompute on cplane. ## Summary of changes Make sure that permit acquire failures are marked as "should not invalidate cache".	2024-05-08 10:33:41 +00:00
Conrad Ludgate	0c99e5ec6d	proxy: cull http connections (#7632 ) ## Problem Some HTTP client connections can stay open for quite a long time. ## Summary of changes When there are too many HTTP client connections, pick a random connection and gracefully cancel it.	2024-05-07 18:15:06 +01:00
Anna Khanova	f1b654b77d	proxy: reduce number of concurrent connections (#7620 ) ## Problem Usually, the connection itself is quite fast (bellow 10ms for p999: https://neonprod.grafana.net/goto/aOyn8vYIg?orgId=1). It doesn't make a lot of sense to wait for a lot of time for the lock, if it takes a lot of time to acquire it, probably, something goes wrong. We also spawn a lot of retries, but they are not super helpful (0 means that it was connected successfully, 1, most probably, that it was re-request of the compute node address https://neonprod.grafana.net/goto/J_8VQvLIR?orgId=1). Let's try to keep a small number of retries.	2024-05-06 19:03:25 +00:00
Tristan Partin	69337be5c2	Fix grammar in provider.rs error message s/temporary/temporarily --------- Co-authored-by: Barry Grenon <barry_grenon@yahoo.ca>	2024-05-06 09:14:42 -05:00
Conrad Ludgate	9b65946566	proxy: add connect compute concurrency lock (#7607 ) ## Problem Too many connect_compute attempts can overwhelm postgres, getting the connections stuck. ## Summary of changes Limit number of connection attempts that can happen at a given time.	2024-05-03 15:45:24 +00:00
Anna Khanova	240efb82f9	Proxy reconnect pubsub before expiration (#7562 ) ## Problem Proxy reconnects to redis only after it's already unavailable. ## Summary of changes Reconnects every 6h.	2024-05-03 10:00:29 +02:00
Anna Khanova	25af32e834	proxy: keep track on the number of events from redis by type. (#7582 ) ## Problem It's unclear what is the distribution of messages, proxy is consuming from redis. ## Summary of changes Add counter.	2024-05-02 09:50:11 +00:00
Conrad Ludgate	cb4b4750ba	update to reqwest 0.12 (#7561 ) ## Problem #7557 ## Summary of changes	2024-05-02 11:16:04 +02:00
Anna Khanova	1684bbf162	proxy: Create disconnect events (#7535 ) ## Problem It's not possible to get the duration of the session from proxy events. ## Summary of changes * Added a separate events folder in s3, to record disconnect events. * Disconnect events are exactly the same as normal events, but also have `disconnect_timestamp` field not empty. * @oruen suggested to fill it with the same information as the original events to avoid potentially heavy joins.	2024-04-29 15:22:13 +02:00
Anna Khanova	90cadfa986	proxy: Adjust retry wake compute (#7537 ) ## Problem Right now we always do retry wake compute. ## Summary of changes Create a list of errors when we could avoid needless retries.	2024-04-29 12:26:21 +00:00
Anna Khanova	24ce878039	proxy: Exclude compute and retries (#7529 ) ## Problem Alerts fire if the connection the compute is slow. ## Summary of changes Exclude compute and retry from latencies.	2024-04-29 11:49:42 +02:00
Anna Khanova	5357f40183	proxy: Workaround switch to the regional redis (#7513 ) ## Problem Start switching from the global redis to the regional one ## Summary of changes * Publish cancellations to the regional redis * Listen notifications from both: global and regional	2024-04-25 15:26:18 +00:00
Anna Khanova	b1d47f3911	proxy: Fix cancellations (#7510 ) ## Problem Cancellations were published to the channel, that was never read. ## Summary of changes Fallback to global redis publishing.	2024-04-25 11:38:51 +00:00
Anna Khanova	a3d62b31bb	Update connect to compute and wake compute retry configs (#7509 ) ## Problem ## Summary of changes Decrease waiting time	2024-04-25 11:16:27 +00:00
Conrad Ludgate	cdccab4bd9	reduce complexity of proxy protocol parse (#7078 ) ## Problem The `WithClientIp` AsyncRead/Write abstraction never filled me with much joy. I would just rather read the protocol header once and then get the remaining buf and reader. ## Summary of changes * Replace `WithClientIp::wait_for_addr` with `read_proxy_protocol`. * Replace `WithClientIp` with `ChainRW`. * Optimise `ChainRW` to make the standard path more optimal.	2024-04-25 11:14:04 +01:00
Arpad Müller	c18d3340b5	Ability to specify the upload_storage_class in S3 bucket configuration (#7461 ) Currently we move data to the intended storage class via lifecycle rules, but those are a daily batch job so data first spends up to a day in standard storage. Therefore, make it possible to specify the storage class used for uploads to S3 so that the data doesn't have to be migrated automatically. The advantage of this is that it gives cleaner billing reports. Part of https://github.com/neondatabase/cloud/issues/11348	2024-04-24 18:48:25 +02:00
Anna Khanova	5dda371c2b	Fix a bug with retries (#7494 ) ## Problem ## Summary of changes By default, it's 5s retry.	2024-04-24 14:13:18 +01:00
Anna Khanova	6a5650d40c	proxy: Make retries configurable and record it. (#7438 ) ## Problem Currently we cannot configure retries, also, we don't really have visibility of what's going on there. ## Summary of changes * Added cli params * Improved logging * Decrease the number of retries: it feels like most of retries doesn't help. Once there would be better errors handling, we can increase it back.	2024-04-22 11:37:22 +00:00
Anna Khanova	5191f6ef0e	proxy: Record only valid rejected events (#7415 ) ## Problem Sometimes rejected metric might record invalid events. ## Summary of changes * Only record it `rejected` was explicitly set. * Change order in logs. * Report metrics if not under high-load.	2024-04-18 06:09:12 +01:00
Conrad Ludgate	a54ea8fb1c	proxy: move endpoint rate limiter (#7413 ) ## Problem ## Summary of changes Rate limit for wake_compute calls	2024-04-18 06:00:33 +01:00
Anna Khanova	d5708e7435	proxy: Record role to span (#7407 ) ## Problem ## Summary of changes Add dbrole to span.	2024-04-17 14:16:11 +02:00

... 2 3 4 5 6 ...

580 Commits