refs - direct IO epic: https://github.com/neondatabase/neon/issues/8130 - concurrent IO epic https://github.com/neondatabase/neon/issues/9378 - obsoletes direct IO proposal RFC: https://github.com/neondatabase/neon/pull/8240 - discussion in https://neondb.slack.com/archives/C07BZ38E6SD/p1746028030574349
9.2 KiB
Vectored Timeline Get
Created on: 2024-01-02 Author: Christian Schwarz
Summary
A brief RFC / GitHub Epic describing a vectored version of the Timeline::get method that is at the heart of Pageserver.
EDIT: the implementation of this feature is described in Vlad's (internal) tech talk.
Motivation
During basebackup, we issue many Timeline::get calls for SLRU pages that are adjacent in key space.
For an example, see
5c88213eaf/pageserver/src/basebackup.rs (L281-L302).
Each of these Timeline::get calls must traverse the layer map to gather reconstruct data (Timeline::get_reconstruct_data) for the requested page number (blknum in the example).
For each layer visited by layer map traversal, we do a DiskBtree point lookup.
If it's negative (no entry), we resume layer map traversal.
If it's positive, we collect the result in our reconstruct data bag.
If the reconstruct data bag contents suffice to reconstruct the page, we're done with get_reconstruct_data and move on to walredo.
Otherwise, we resume layer map traversal.
Doing this many Timeline::get calls is quite inefficient because:
- We do the layer map traversal repeatedly, even if, e.g., all the data sits in the same image layer at the bottom of the stack.
- We may visit many DiskBtree inner pages multiple times for point lookup of different keys. This is likely particularly bad for L0s which span the whole key space and hence must be visited by layer map traversal, but may not contain the data we're looking for.
- Anecdotally, keys adjacent in keyspace and written simultaneously also end up physically adjacent in the layer files 1 . So, to provide the reconstruct data for N adjacent keys, we would actually only need to issue a single large read to the filesystem, instead of the N reads we currently do. The filesystem, in turn, ideally stores the layer file physically contiguously, so our large read will turn into one IOP toward the disk.
Solution
We should have a vectored aka batched aka scatter-gather style alternative API for Timeline::get. Having such an API unlocks:
- more efficient basebackup
- batched IO during compaction (useful for strides of unchanged pages)
- page_service: expose vectored get_page_at_lsn for compute (=> good for seqscan / prefetch)
- if on-demand SLRU downloads land before vectored Timeline::get, on-demand SLRU downloads will still benefit from this API
DoD
There is a new variant of Timeline::get, called Timeline::get_vectored.
It takes as arguments an lsn: Lsn and a src: &[KeyVec] where struct KeyVec { base: Key, count: usize }.
It is up to the implementor to figure out a suitable and efficient way to return the reconstructed page images.
It is sufficient to simply return a Vec<Bytes>, but, likely more efficient solutions can be found after studying all the callers of Timeline::get.
Functionally, the behavior of Timeline::get_vectored is equivalent to
let mut keys_iter: impl Iterator<Item=Key>
= src.map(|KeyVec{ base, count }| (base..base+count)).flatten();
let mut out = Vec::new();
for key in keys_iter {
let data = Timeline::get(key, lsn)?;
out.push(data);
}
return out;
However, unlike above, an ideal solution will
- Visit each
struct Layerat most once. - For each visited layer, call
Layer::get_value_reconstruct_dataat most once.- This means, read each
DiskBtreepage at most once.
- This means, read each
- Facilitate merging of the reads we issue to the OS and eventually NVMe.
Each of these items above represents a significant amount of work.
Performance
Ideally, the base performance of a vectored get of a single page should be identical to the current Timeline::get.
A reasonable constant overhead over current Timeline::get is acceptable.
The performance improvement for the vectored use case is demonstrated in some way, e.g., using the pagebench basebackup benchmark against a tenant with a lot of SLRU segments.
Implementation
High-level set of tasks / changes to be made:
- Get clarity on API:
- Define naive
Timeline::get_vectoredimplementation & adopt it across pageserver. - The tricky thing here will be the return type (e.g.
Vec<Bytes>vsimpl Stream). - Start with something simple to explore the different usages of the API. Then iterate with peers until we have something that is good enough.
- Define naive
- Vectored Layer Map traversal
- Vectored
LayerMap::search(take 1 LSN and NKeys instead of just 1 LSN and 1Key) - Refactor
Timeline::get_reconstruct_datato hold & return state for NKeys instead of 1- The slightly tricky part here is what to do about
cont_lsnafter we've found some reconstruct data for some keys but need more. Likely we'll need to keep track ofcont_lsnper key and continue next iteration atmax(cont_lsn)of all keys that still need data.
- The slightly tricky part here is what to do about
- Vectored
- Vectored
Layer::get_value_reconstruct_data/DiskBtree- Current code calls it here.
- Delta layers use
DiskBtreeReader::visit()to collect the(offset,len)pairs for delta record blobs to load. - Image layers use
DiskBtreeReader::getto get the offset of the image blob to load. Underneath, that's just a::visit()call. - What needs to happen to
DiskBtree::visit()?- Minimally
- take a single
KeyVecinstead of a singleKeyas argument, i.e., take a single contiguous key range to visit. - Change the visit code to to invoke the callback for all values in the
KeyVec's key range - This should be good enough for what we've seen when investigating basebackup slowness, because there, the key ranges are contiguous.
- take a single
- Ideally:
- Take a
&[KeyVec], sort it; - during Btree traversal, peek at the next
KeyVecrange to determine whether we need to descend or back out. - NB: this should be a straight-forward extension of the minimal solution above, as we'll already be checking for "is there more key range in the requested
KeyVec".
- Take a
- Minimally
- Facilitate merging of the reads we issue to the OS and eventually NVMe.
- The
DiskBtree::visitproduces a set of offsets which we then read from aVirtualFilehere- Delta layer reads
- We hit (and rely) on
PageCacheand `VirtualFile here (not great under pressure)
- We hit (and rely) on
- Image layer reads
- Delta layer reads
- What needs to happen is the vectorization of the
blob_iointerface and then theVirtualFileAPI. - That is tricky because
- the
VirtualFileAPI, which sits underneathblob_io, is being touched by ongoing io_uring work - there's the question how IO buffers will be managed; currently this area relies heavily on
PageCache, but there's controversy around the future ofPageCache.- The guiding principle here should be to avoid coupling this work to the
PageCache. - I.e., treat
PageCacheas an extra hop in the I/O chain, rather than as an integral part of buffer management.
- The guiding principle here should be to avoid coupling this work to the
- the
- The
Let's see how we can improve by doing the first three items in above list first, then revisit.
Rollout / Feature Flags
No feature flags are required for this epic.
At the end of this epic, Timeline::get forwards to Timeline::get_vectored, i.e., it's an all-or-nothing type of change.
It is encouraged to deliver this feature incrementally, i.e., do many small PRs over multiple weeks. That will help isolate performance regressions across weekly releases.
Interaction With Sharding
Sharding splits up the key space, see functions is_key_local / key_to_shard_number.
Just as with Timeline::get, callers of Timeline::get_vectored are responsible for ensuring that they only ask for blocks of the given struct Timeline's shard.
Given that this is already the case, there shouldn't be significant interaction/interference with sharding.
However, let's have a safety check for this constraint (error or assertion) because there are currently few affordances at the higher layers of Pageserver for sharding<=>keyspace interaction.
For example, KeySpace is not broken up by shard stripe, so if someone naively converted the compaction code to issue a vectored get for a keyspace range it would violate this constraint.