rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-06 21:12:55 +00:00

Author	SHA1	Message	Date
Arpad Müller	75747cdbff	Use serde for RemoteStorageConfig parsing (#8126 ) Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the same as #7743 but with less repetitive code, by deriving `Deserialize` impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The disadvantage is less useful error messages. The git history of this PR contains a state where we go via an intermediate representation, leveraging the `serde_json` crate, without it ever being actual json though. Also, the PR adds deserialization tests. Alternative to #7743 .	2024-06-22 17:57:09 +00:00
Conrad Ludgate	b998b70315	proxy: reduce some per-task memory usage (#8095 ) ## Problem Some tasks are using around upwards of 10KB of memory at all times, sometimes having buffers that swing them up to 30MB. ## Summary of changes Split some of the async tasks in selective places and box them as appropriate to try and reduce the constant memory usage. Especially in the locations where the large future is only a small part of the total runtime of the task. Also, reduces the size of the CopyBuffer buffer size from 8KB to 1KB. In my local testing and in staging this had a minor improvement. sadly not the improvement I was hoping for :/ Might have more impact in production	2024-06-19 13:34:15 +01:00
Conrad Ludgate	e6eb0020a1	update rust to 1.79.0 (#8048 ) ## Problem rust 1.79 new enabled by default lints ## Summary of changes * update to rust 1.79 * `s/default_features/default-features/` * fix proxy dead code. * fix pageserver dead code.	2024-06-14 13:23:52 +02:00
Anna Khanova	fbccd1e676	Proxy process updated errors (#8026 ) ## Problem Respect errors classification from cplane	2024-06-13 14:42:26 +02:00
Yuchen Liang	630cfbe420	refactor(pageserver): designated api error type for cancelled request (#7949 ) Closes #7406. ## Problem When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is exposed to handle that case, which verbosely logs the error. However, we don't benefit from having the full backtrace provided by anyhow in this case. ## Summary of changes This PR introduces a new `ApiError` type to handle errors caused by cancelled request more robustly. - A new enum variant `ApiError::Cancelled` - Currently the cancelled request is mapped to status code 500. - Need to handle this error in proxy's `http_util` as well. - Added a failpoint test to simulate cancelled `get_lsn_by_timestamp` request. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-06 14:00:14 +00:00
Anna Khanova	00032c9d9f	[proxy] Fix dynamic rate limiter (#7950 ) ## Problem There was a bug in dynamic rate limiter, which exhausted CPU in proxy and proxy wasn't able to accept any connections. ## Summary of changes 1. `if self.available > 1` -> `if self.available >= 1` 2. remove `timeout_at` to use just timeout 3. remove potential infinite loops which can exhaust CPUs.	2024-06-04 05:07:54 +01:00
Conrad Ludgate	9a081c230f	proxy: lazily parse startup pg params (#7905 ) ## Problem proxy params being a `HashMap<String,String>` when it contains just ``` application_name: psql database: neondb user: neondb_owner ``` is quite wasteful allocation wise. ## Summary of changes Keep the params in the wire protocol form, eg: ``` application_name\0psql\0database\0neondb\0user\0neondb_owner\0 ``` Using a linear search for the map is fast enough at small sizes, which is the normal case.	2024-05-30 11:02:38 +00:00
Conrad Ludgate	fddd11dd1a	proxy: upload postgres connection options as json in the parquet upload (#7903 ) ## Problem https://github.com/neondatabase/cloud/issues/9943 ## Summary of changes Captures the postgres options, converts them to json, uploads them in parquet.	2024-05-30 11:10:27 +01:00
Conrad Ludgate	238fa47bee	proxy fix wake compute rate limit (#7902 ) ## Problem We were rate limiting wake_compute in the wrong place ## Summary of changes Move wake_compute rate limit to after the permit is acquired. Also makes a slight refactor on normalize, as it caught my eye	2024-05-30 11:09:27 +01:00
Conrad Ludgate	c8cebecabf	proxy: reintroduce dynamic limiter for compute lock (#7737 ) ## Problem Computes that are healthy can manage many connection attempts at a time. Unhealthy computes cannot. We initially handled this with a fixed concurrency limit, but it seems this inhibits pgbench. ## Summary of changes Support AIMD for connect_to_compute lock to allow varying the concurrency limit based on compute health	2024-05-29 11:17:05 +01:00
Arpad Müller	14df69d0e3	Drop postgres-native-tls in favour of tokio-postgres-rustls (#7883 ) Get rid of postgres-native-tls and openssl in favour of rustls in our dependency tree. Do further steps to completely remove native-tls and openssl. Among other advantages, this allows us to do static musl builds more easily: #7889	2024-05-28 15:40:52 +00:00
Conrad Ludgate	43f9a16e46	proxy: fix websocket buffering (#7878 ) ## Problem Seems the websocket buffering was broken for large query responses only ## Summary of changes Move buffering until after the underlying stream is ready. Tested locally confirms this fixes the bug. Also fixes the pg-sni-router missing metrics bug	2024-05-24 17:56:12 +01:00
Anna Khanova	cd6d811213	[proxy] Do not fail after parquet upload error (#7858 ) ## Problem If the parquet upload was unsuccessful, it will panic. ## Summary of changes Write error in logs instead.	2024-05-23 09:41:29 +00:00
Conrad Ludgate	9cfe08e3d9	proxy password threadpool (#7806 ) ## Problem Despite making password hashing async, it can still take time away from the network code. ## Summary of changes Introduce a custom threadpool, inspired by rayon. Features: ### Fairness Each task is tagged with it's endpoint ID. The more times we have seen the endpoint, the more likely we are to skip the task if it comes up in the queue. This is using a min-count-sketch estimator for the number of times we have seen the endpoint, resetting it every 1000+ steps. Since tasks are immediately rescheduled if they do not complete, the worker could get stuck in a "always work available loop". To combat this, we check the global queue every 61 steps to ensure all tasks quickly get a worker assigned to them. ### Balanced Using crossbeam_deque, like rayon does, we have workstealing out of the box. I've tested it a fair amount and it seems to balance the workload accordingly	2024-05-22 17:05:43 +00:00
Conrad Ludgate	a5ecca976e	proxy: bump parquet (#7782 ) ## Summary of changes Updates the parquet lib. one change left that we need is in an open PR against upstream, hopefully we can remove the git dependency by 52.0.0 https://github.com/apache/arrow-rs/pull/5773 I'm not sure why the parquet files got a little bit bigger. I tested them and they still open fine. 🤷 side effect of the update, chrono updated and added yet another deprecation warning (hence why the safekeepers change)	2024-05-19 19:45:53 +00:00
Conrad Ludgate	790c05d675	proxy: swap tungstenite for a simpler impl (#7353 ) ## Problem I wanted to do a deep dive of the tungstenite codebase. tokio-tungstenite is incredibly convoluted... In my searching I found [fastwebsockets by deno](https://github.com/denoland/fastwebsockets), but it wasn't quite sufficient. This also removes the default 16MB/64MB frame/message size limitation. framed-websockets solves this by inserting continuation frames for partially received messages, so the whole message does not need to be entirely read into memory. ## Summary of changes I took the fastwebsockets code as a starting off point and rewrote it to be simpler, server-only, and be poll-based to support our Read/Write wrappers. I have replaced our tungstenite code with my framed-websockets fork. <https://github.com/neondatabase/framed-websockets>	2024-05-16 13:05:50 +02:00
Anna Khanova	be1a88e574	Proxy added per ep rate limiter (#7636 ) ## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).	2024-05-10 12:17:00 +02:00
Conrad Ludgate	e3a2631df9	proxy: do not invalidate cache for permit errors (#7652 ) ## Problem If a permit cannot be acquired to connect to compute, the cache is invalidated. This had the observed affect of sending more traffic to ProxyWakeCompute on cplane. ## Summary of changes Make sure that permit acquire failures are marked as "should not invalidate cache".	2024-05-08 10:33:41 +00:00
Conrad Ludgate	0c99e5ec6d	proxy: cull http connections (#7632 ) ## Problem Some HTTP client connections can stay open for quite a long time. ## Summary of changes When there are too many HTTP client connections, pick a random connection and gracefully cancel it.	2024-05-07 18:15:06 +01:00
Anna Khanova	f1b654b77d	proxy: reduce number of concurrent connections (#7620 ) ## Problem Usually, the connection itself is quite fast (bellow 10ms for p999: https://neonprod.grafana.net/goto/aOyn8vYIg?orgId=1). It doesn't make a lot of sense to wait for a lot of time for the lock, if it takes a lot of time to acquire it, probably, something goes wrong. We also spawn a lot of retries, but they are not super helpful (0 means that it was connected successfully, 1, most probably, that it was re-request of the compute node address https://neonprod.grafana.net/goto/J_8VQvLIR?orgId=1). Let's try to keep a small number of retries.	2024-05-06 19:03:25 +00:00
Tristan Partin	69337be5c2	Fix grammar in provider.rs error message s/temporary/temporarily --------- Co-authored-by: Barry Grenon <barry_grenon@yahoo.ca>	2024-05-06 09:14:42 -05:00
Conrad Ludgate	9b65946566	proxy: add connect compute concurrency lock (#7607 ) ## Problem Too many connect_compute attempts can overwhelm postgres, getting the connections stuck. ## Summary of changes Limit number of connection attempts that can happen at a given time.	2024-05-03 15:45:24 +00:00
Anna Khanova	240efb82f9	Proxy reconnect pubsub before expiration (#7562 ) ## Problem Proxy reconnects to redis only after it's already unavailable. ## Summary of changes Reconnects every 6h.	2024-05-03 10:00:29 +02:00
Anna Khanova	25af32e834	proxy: keep track on the number of events from redis by type. (#7582 ) ## Problem It's unclear what is the distribution of messages, proxy is consuming from redis. ## Summary of changes Add counter.	2024-05-02 09:50:11 +00:00
Conrad Ludgate	cb4b4750ba	update to reqwest 0.12 (#7561 ) ## Problem #7557 ## Summary of changes	2024-05-02 11:16:04 +02:00
Anna Khanova	1684bbf162	proxy: Create disconnect events (#7535 ) ## Problem It's not possible to get the duration of the session from proxy events. ## Summary of changes * Added a separate events folder in s3, to record disconnect events. * Disconnect events are exactly the same as normal events, but also have `disconnect_timestamp` field not empty. * @oruen suggested to fill it with the same information as the original events to avoid potentially heavy joins.	2024-04-29 15:22:13 +02:00
Anna Khanova	90cadfa986	proxy: Adjust retry wake compute (#7537 ) ## Problem Right now we always do retry wake compute. ## Summary of changes Create a list of errors when we could avoid needless retries.	2024-04-29 12:26:21 +00:00
Anna Khanova	24ce878039	proxy: Exclude compute and retries (#7529 ) ## Problem Alerts fire if the connection the compute is slow. ## Summary of changes Exclude compute and retry from latencies.	2024-04-29 11:49:42 +02:00
Anna Khanova	5357f40183	proxy: Workaround switch to the regional redis (#7513 ) ## Problem Start switching from the global redis to the regional one ## Summary of changes * Publish cancellations to the regional redis * Listen notifications from both: global and regional	2024-04-25 15:26:18 +00:00
Anna Khanova	b1d47f3911	proxy: Fix cancellations (#7510 ) ## Problem Cancellations were published to the channel, that was never read. ## Summary of changes Fallback to global redis publishing.	2024-04-25 11:38:51 +00:00
Anna Khanova	a3d62b31bb	Update connect to compute and wake compute retry configs (#7509 ) ## Problem ## Summary of changes Decrease waiting time	2024-04-25 11:16:27 +00:00
Conrad Ludgate	cdccab4bd9	reduce complexity of proxy protocol parse (#7078 ) ## Problem The `WithClientIp` AsyncRead/Write abstraction never filled me with much joy. I would just rather read the protocol header once and then get the remaining buf and reader. ## Summary of changes * Replace `WithClientIp::wait_for_addr` with `read_proxy_protocol`. * Replace `WithClientIp` with `ChainRW`. * Optimise `ChainRW` to make the standard path more optimal.	2024-04-25 11:14:04 +01:00
Arpad Müller	c18d3340b5	Ability to specify the upload_storage_class in S3 bucket configuration (#7461 ) Currently we move data to the intended storage class via lifecycle rules, but those are a daily batch job so data first spends up to a day in standard storage. Therefore, make it possible to specify the storage class used for uploads to S3 so that the data doesn't have to be migrated automatically. The advantage of this is that it gives cleaner billing reports. Part of https://github.com/neondatabase/cloud/issues/11348	2024-04-24 18:48:25 +02:00
Anna Khanova	5dda371c2b	Fix a bug with retries (#7494 ) ## Problem ## Summary of changes By default, it's 5s retry.	2024-04-24 14:13:18 +01:00
Anna Khanova	6a5650d40c	proxy: Make retries configurable and record it. (#7438 ) ## Problem Currently we cannot configure retries, also, we don't really have visibility of what's going on there. ## Summary of changes * Added cli params * Improved logging * Decrease the number of retries: it feels like most of retries doesn't help. Once there would be better errors handling, we can increase it back.	2024-04-22 11:37:22 +00:00
Anna Khanova	5191f6ef0e	proxy: Record only valid rejected events (#7415 ) ## Problem Sometimes rejected metric might record invalid events. ## Summary of changes * Only record it `rejected` was explicitly set. * Change order in logs. * Report metrics if not under high-load.	2024-04-18 06:09:12 +01:00
Conrad Ludgate	a54ea8fb1c	proxy: move endpoint rate limiter (#7413 ) ## Problem ## Summary of changes Rate limit for wake_compute calls	2024-04-18 06:00:33 +01:00
Anna Khanova	d5708e7435	proxy: Record role to span (#7407 ) ## Problem ## Summary of changes Add dbrole to span.	2024-04-17 14:16:11 +02:00
Anna Khanova	fd49005cb3	proxy: Improve logging (#7405 ) ## Problem It's unclear from logs what's going on with the regional redis. ## Summary of changes Make logs better.	2024-04-17 11:33:31 +00:00
Anna Khanova	13b9135d4e	proxy: Cleanup unused rate limiter (#7400 ) ## Problem There is an unused dead code. ## Summary of changes Let's remove it. In case we would need it in the future, we can always return it back. Also removed cli arguments. They shouldn't be used by anyone but us.	2024-04-17 11:11:49 +02:00
Conrad Ludgate	e5c50bb12b	proxy: rate limit authentication by masked IPv6. (#7316 ) ## Problem Many users have access to ipv6 subnets (eg a /64). That gives them 2^64 addresses to play with ## Summary of changes Truncate the address to /64 to reduce the attack surface. Todo: ~~Will NAT64 be an issue here? AFAIU they put the IPv4 address at the end of the IPv6 address. By truncating we will lose all that detail.~~ It's the same problem as a host sharing IPv6 addresses between clients. I don't think it's up to us to solve. If a customer is getting DDoSed, then they likely need to arrange a dedicated IP with us.	2024-04-16 14:16:34 +00:00
Anna Khanova	110282ee7e	proxy: Exclude private ip errors from recorded metrics (#7389 ) ## Problem Right now we record errors from internal VPC. ## Summary of changes * Exclude it from the metrics. * Simplify pg-sni-router	2024-04-15 20:21:50 +02:00
Conrad Ludgate	e92fb94149	proxy: fix overloaded db connection closure (#7364 ) ## Problem possible for the database connections to not close in time. ## Summary of changes force the closing of connections if the client has hung up	2024-04-11 20:55:05 +00:00
Anna Khanova	40f15c3123	Read cplane events from regional redis (#7352 ) ## Problem Actually read redis events. ## Summary of changes This is revert of https://github.com/neondatabase/neon/pull/7350 + fixes. * Fixed events parsing * Added timeout after connection failure * Separated regional and global redis clients.	2024-04-11 18:24:34 +00:00
Conrad Ludgate	5299f917d6	proxy: replace prometheus with measured (#6717 ) ## Problem My benchmarks show that prometheus is not very good. https://github.com/conradludgate/measured We're already using it in storage_controller and it seems to be working well. ## Summary of changes Replace prometheus with my new measured crate in proxy only. Apologies for the large diff. I tried to keep it as minimal as I could. The label types add a bit of boiler plate (but reduce the chance we mistype the labels), and some of our custom metrics like CounterPair and HLL needed to be rewritten.	2024-04-11 16:26:01 +00:00
Anna Khanova	0bb04ebe19	Revert "Proxy read ids from redis (#7205 )" (#7350 ) This reverts commit `dbac2d2c47`. ## Problem Proxy pods fails to install in k8s clusters, cplane release blocking. ## Summary of changes Revert	2024-04-10 10:12:55 +00:00
Anna Khanova	5efe95a008	proxy: fix credentials cache lookup (#7349 ) ## Problem Incorrect processing of `-pooler` connections. ## Summary of changes Fix TODO: add e2e tests for caching	2024-04-10 08:30:09 +00:00
Conrad Ludgate	c0ff4f18dc	proxy: hyper1 for only proxy (#7073 ) ## Problem hyper1 offers control over the HTTP connection that hyper0_14 does not. We're blocked on switching all services to hyper1 because of how we use tonic, but no reason we can't switch proxy over. ## Summary of changes 1. hyper0.14 -> hyper1 1. self managed server 2. Remove the `WithConnectionGuard` wrapper from `protocol2` 2. Remove TLS listener as it's no longer necessary 3. include first session ID in connection startup logs	2024-04-10 08:23:59 +00:00
Anna Khanova	dbac2d2c47	Proxy read ids from redis (#7205 ) ## Problem Proxy doesn't know about existing endpoints. ## Summary of changes * Added caching of all available endpoints. * On the high load, use it before going to cplane. * Report metrics for the outcome. * For rate limiter and credentials caching don't distinguish between `-pooled` and not TODOs: * Make metrics more meaningful * Consider integrating it with the endpoint rate limiter * Test it together with cplane in preview	2024-04-10 02:40:14 +02:00
Conrad Ludgate	55da8eff4f	proxy: report metrics based on cold start info (#7324 ) ## Problem Would be nice to have a bit more info on cold start metrics. ## Summary of changes * Change connect compute latency to include `cold_start_info`. * Update `ColdStartInfo` to include HttpPoolHit and WarmCached. * Several changes to make more use of interned strings	2024-04-05 16:14:50 +01:00

1 2 3 4 5 ...

418 Commits