mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-30 03:20:36 +00:00
dfcbb139fb7a0e287dca4ff04d650d048d04f002
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
49d5e56c08 |
pageserver: use direct IO for delta and image layer reads (#9326)
Part of #8130 ## Problem Pageserver previously goes through the kernel page cache for all the IOs. The kernel page cache makes light-loaded pageserver have deceptive fast performance. Using direct IO would offer predictable latencies of our virtual file IO operations. In particular for reads, the data pages also have an extremely low temporal locality because the most frequently accessed pages are cached on the compute side. ## Summary of changes This PR enables pageserver to use direct IO for delta layer and image layer reads. We can ship them separately because these layers are write-once, read-many, so we will not be mixing buffered IO with direct IO. - implement `IoBufferMut`, an buffer type with aligned allocation (currently set to 512). - use `IoBufferMut` at all places we are doing reads on image + delta layers. - leverage Rust type system and use `IoBufAlignedMut` marker trait to guarantee that the input buffers for the IO operations are aligned. - page cache allocation is also made aligned. _* in-memory layer reads and the write path will be shipped separately._ ## Testing Integration test suite run with O_DIRECT enabled: https://github.com/neondatabase/neon/pull/9350 ## Performance We evaluated performance based on the `get-page-at-latest-lsn` benchmark. The results demonstrate a decrease in the number of IOps, no sigificant change in the latency mean, and an slight improvement on the p99.9 and p99.99 latencies. [Benchmark](https://www.notion.so/neondatabase/Benchmark-O_DIRECT-for-image-and-delta-layers-2024-10-01-112f189e00478092a195ea5a0137e706?pvs=4) ## Rollout We will add `virtual_file_io_mode=direct` region by region to enable direct IO on image + delta layers. Signed-off-by: Yuchen Liang <yuchen@neon.tech> |
||
|
|
9627747d35 |
bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537)
Part of [Epic: Bypass PageCache for user data blocks](https://github.com/neondatabase/neon/issues/7386). # Problem `InMemoryLayer` still uses the `PageCache` for all data stored in the `VirtualFile` that underlies the `EphemeralFile`. # Background Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered writer around a `VirtualFile` that supports `blob_io`. The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`. At those offset, we find a varint length followed by the serialized `Value`. Vectored reads (`get_values_reconstruct_data`) are not in fact vectored - each `Value` that needs to be read is read sequentially. The `will_init` bit of information which we use to early-exit the `get_values_reconstruct_data` for a given key is stored in the serialized `Value`, meaning we have to read & deserialize the `Value` from the `EphemeralFile`. The L0 flushing **also** needs to re-determine the `will_init` bit of information, by deserializing each value during L0 flush. # Changes 1. Store the value length and `will_init` information in the `InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the values. 2. For `get_values_reconstruct_data`: - Use the in-memory `index` figures out which values need to be read. Having the `will_init` stored in the index enables us to do that. - View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO. - Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read. - Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk. - Merge adjacent chunk reads into larger `EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable constant). 3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer. 4. The L0 flush code is changed to use the `index` directly, `blob_io` 5. We can remove the `ephemeral_file::page_caching` construct now. The `get_values_reconstruct_data` changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same `get_values_reconstruct_data` call from the same `EphemeralFile` page were a hit. The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on [direct IO](https://github.com/neondatabase/neon/issues/8130) and https://github.com/neondatabase/neon/issues/8719 specifically, we need to do _something_ like this anyways in the near future. # Alternative Design The original plan was to use the `vectored_blob_io` code it relies on the invariant of Delta&Image layers that `index order == values order`. Further, `vectored_blob_io` code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer. => https://github.com/neondatabase/neon/issues/8719 # Testing / Performance Correctness of the IO merging code is ensured by unit tests. Additionally, minimal tests are added for the `EphemeralFile` implementation and the bit-packed `InMemoryLayerIndexValue`. Performance testing results are presented below. All pref testing done on my M2 MacBook Pro, running a Linux VM. It's a release build without `--features testing`. We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer. ``` baseline: commit |