mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-15 09:22:55 +00:00
This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at https://github.com/neondatabase/neon/issues/4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * https://github.com/neondatabase/neon/pull/4316 * https://github.com/neondatabase/neon/pull/4317 * https://github.com/neondatabase/neon/pull/4318 * https://github.com/neondatabase/neon/pull/4319 * https://github.com/neondatabase/neon/pull/4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. https://github.com/neondatabase/neon/pull/4364 3. https://github.com/neondatabase/neon/pull/4472 4. https://github.com/neondatabase/neon/pull/4476 5. https://github.com/neondatabase/neon/pull/4477 6. https://github.com/neondatabase/neon/pull/4485 7. https://github.com/neondatabase/neon/pull/4441