In complement to
https://github.com/neondatabase/tokio-epoll-uring/pull/56.
## Problem
We want to make tokio-epoll-uring slots waiters queue depth observable
via Prometheus.
## Summary of changes
- Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth`
metrics as a `Histogram`.
- Each thread-local tokio-epoll-uring system is given a `LocalHistogram`
to observe the metrics.
- Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data
to the shared histogram.
- Extend `Collector::collect` to report
`pageserver_tokio_epoll_uring_slots_submission_queue_depth`.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Part of #8130
## Problem
Pageserver previously goes through the kernel page cache for all the
IOs. The kernel page cache makes light-loaded pageserver have deceptive
fast performance. Using direct IO would offer predictable latencies of
our virtual file IO operations.
In particular for reads, the data pages also have an extremely low
temporal locality because the most frequently accessed pages are cached
on the compute side.
## Summary of changes
This PR enables pageserver to use direct IO for delta layer and image
layer reads. We can ship them separately because these layers are
write-once, read-many, so we will not be mixing buffered IO with direct
IO.
- implement `IoBufferMut`, an buffer type with aligned allocation
(currently set to 512).
- use `IoBufferMut` at all places we are doing reads on image + delta
layers.
- leverage Rust type system and use `IoBufAlignedMut` marker trait to
guarantee that the input buffers for the IO operations are aligned.
- page cache allocation is also made aligned.
_* in-memory layer reads and the write path will be shipped separately._
## Testing
Integration test suite run with O_DIRECT enabled:
https://github.com/neondatabase/neon/pull/9350
## Performance
We evaluated performance based on the `get-page-at-latest-lsn`
benchmark. The results demonstrate a decrease in the number of IOps, no
sigificant change in the latency mean, and an slight improvement on the
p99.9 and p99.99 latencies.
[Benchmark](https://www.notion.so/neondatabase/Benchmark-O_DIRECT-for-image-and-delta-layers-2024-10-01-112f189e00478092a195ea5a0137e706?pvs=4)
## Rollout
We will add `virtual_file_io_mode=direct` region by region to enable
direct IO on image + delta layers.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
This PR simplifies the pageserver configuration parsing as follows:
* introduce the `pageserver_api::config::ConfigToml` type
* implement `Default` for `ConfigToml`
* use serde derive to do the brain-dead leg-work of processing the toml
document
* use `serde(default)` to fill in default values
* in `pageserver` crate:
* use `toml_edit` to deserialize the pageserver.toml string into a
`ConfigToml`
* `PageServerConfig::parse_and_validate` then
* consumes the `ConfigToml`
* destructures it exhaustively into its constituent fields
* constructs the `PageServerConfig`
The rules are:
* in `ConfigToml`, use `deny_unknown_fields` everywhere
* static default values go in `pageserver_api`
* if there cannot be a static default value (e.g. which default IO
engine to use, because it depends on the runtime), make the field in
`ConfigToml` an `Option`
* if runtime-augmentation of a value is needed, do that in
`parse_and_validate`
* a good example is `virtual_file_io_engine` or `l0_flush`, both of
which need to execute code to determine the effective value in
`PageServerConf`
The benefits:
* massive amount of brain-dead repetitive code can be deleted
* "unused variable" compile-time errors when removing a config value,
due to the exhaustive destructuring in `parse_and_validate`
* compile-time errors guide you when adding a new config field
Drawbacks:
* serde derive is sometimes a bit too magical
* `deny_unknown_fields` is easy to miss
Future Work / Benefits:
* make `neon_local` use `pageserver_api` to construct `ConfigToml` and
write it to `pageserver.toml`
* This provides more type safety / coompile-time errors than the current
approach.
### Refs
Fixes#3682
### Future Work
* `remote_storage` deser doesn't reject unknown fields
https://github.com/neondatabase/neon/issues/8915
* clean up `libs/pageserver_api/src/config.rs` further
* break up into multiple files, at least for tenant config
* move `models` as appropriate / refine distinction between config and
API models / be explicit about when it's the same
* use `pub(crate)` visibility on `mod defaults` to detect stale values
Part of [Epic: Bypass PageCache for user data
blocks](https://github.com/neondatabase/neon/issues/7386).
# Problem
`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.
# Background
Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.
The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.
Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.
The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.
The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.
# Changes
1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io`
5. We can remove the `ephemeral_file::page_caching` construct now.
The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.
The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](https://github.com/neondatabase/neon/issues/8130) and
https://github.com/neondatabase/neon/issues/8719 specifically, we need
to do _something_ like this anyways in the near future.
# Alternative Design
The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.
Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> https://github.com/neondatabase/neon/issues/8719
# Testing / Performance
Correctness of the IO merging code is ensured by unit tests.
Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.
Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.
We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.
```
baseline: commit 7c74112b2a origin/main
HEAD: ef1c55c52e
```
<details>
```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'
baseline
ingest-small-values/ingest 128MB/100b seq, no delta
time: [483.50 ms 498.73 ms 522.53 ms]
thrpt: [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]
HEAD
ingest-small-values/ingest 128MB/100b seq, no delta
time: [479.22 ms 482.92 ms 487.35 ms]
thrpt: [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```
</details>
We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.
<details>
```
./target/release/neon_local stop
rm -rf .neon
./target/release/neon_local init
./target/release/neon_local start
./target/release/neon_local tenant create --set-default
./target/release/neon_local endpoint create foo
./target/release/neon_local endpoint start foo
psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)
CREATE TABLE wal_test (
id SERIAL PRIMARY KEY,
data TEXT
);
DO $$
DECLARE
i INTEGER := 1;
BEGIN
WHILE i <= 500000 LOOP
INSERT INTO wal_test (data) VALUES ('data');
i := i + 1;
END LOOP;
END $$;
-- => result is one L0 from initdb and one 137M-sized ephemeral-2
DO $$
DECLARE
i INTEGER := 1;
random_id INTEGER;
random_record wal_test%ROWTYPE;
start_time TIMESTAMP := clock_timestamp();
selects_completed INTEGER := 0;
min_id INTEGER := 1; -- Minimum ID value
max_id INTEGER := 100000; -- Maximum ID value, based on your insert range
iters INTEGER := 100000000; -- Number of iterations to run
BEGIN
WHILE i <= iters LOOP
-- Generate a random ID within the known range
random_id := min_id + floor(random() * (max_id - min_id + 1))::int;
-- Select the row with the generated random ID
SELECT * INTO random_record
FROM wal_test
WHERE id = random_id;
-- Increment the select counter
selects_completed := selects_completed + 1;
-- Check if a second has passed
IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
-- Print the number of selects completed in the last second
RAISE NOTICE 'Selects completed in last second: %', selects_completed;
-- Reset counters for the next second
selects_completed := 0;
start_time := clock_timestamp();
END IF;
-- Increment the loop counter
i := i + 1;
END LOOP;
END $$;
./target/release/neon_local stop
baseline: commit 7c74112b2a origin/main
NOTICE: Selects completed in last second: 1864
NOTICE: Selects completed in last second: 1850
NOTICE: Selects completed in last second: 1851
NOTICE: Selects completed in last second: 1918
NOTICE: Selects completed in last second: 1911
NOTICE: Selects completed in last second: 1879
NOTICE: Selects completed in last second: 1858
NOTICE: Selects completed in last second: 1827
NOTICE: Selects completed in last second: 1933
ours
NOTICE: Selects completed in last second: 1915
NOTICE: Selects completed in last second: 1928
NOTICE: Selects completed in last second: 1913
NOTICE: Selects completed in last second: 1932
NOTICE: Selects completed in last second: 1846
NOTICE: Selects completed in last second: 1955
NOTICE: Selects completed in last second: 1991
NOTICE: Selects completed in last second: 1973
```
NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.
</details>
# Rollout
This PR changes the code in-place and is not gated by a feature flag.
The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird.
The new `FullSlice` newtype is better. See the doc comment for details.
The naming is not ideal, but we'll clean that up in a future refactoring
where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do
the following:
* tokio_epoll_uring::Slice is removed
* `FullSlice` becomes `tokio_epoll_uring::IoBufView`
* new type `tokio_epoll_uring::IoBufMutView` for the current
`tokio_epoll_uring::Slice<IoBufMut>`
Context
-------
I did this work in preparation for
https://github.com/neondatabase/neon/pull/8537.
There, I'm changing the type that the `inmemory_layer.rs` passes to
`DeltaLayerWriter::put_value_bytes` and thus it seemed like a good
opportunity to make this cleanup first.
## Problem
Some developers build on MacOS, which doesn't have io_uring.
## Summary of changes
- Add `io_engine_for_bench`, which on linux will give io_uring or panic
if it's unavailable, and on MacOS will always panic.
We do not want to run such benchmarks with StdFs: the results aren't
interesting, and will actively waste the time of any developers who
start investigating performance before they realize they're using a
known-slow I/O backend.
Why not just conditionally compile this benchmark on linux only? Because
even on linux, I still want it to refuse to run if it can't get
io_uring.
## Problem
new clippy warnings on nightly.
## Summary of changes
broken up each commit by warning type.
1. Remove some unnecessary refs.
2. In edition 2024, inference will default to `!` and not `()`.
3. Clippy complains about doc comment indentation
4. Fix `Trait + ?Sized` where `Trait: Sized`.
5. diesel_derives triggering `non_local_defintions`
part of https://github.com/neondatabase/neon/issues/7418
I reviewed how the VirtualFile API's `read` methods look like and came
to the conclusion that we've been using `IoBufMut` / `BoundedBufMut` /
`Slice` wrong.
This patch rectifies the situation.
# Change 1: take `tokio_epoll_uring::Slice` in the read APIs
Before, we took an `IoBufMut`, which is too low of a primitive and while
it _seems_ convenient to be able to pass in a `Vec<u8>` without any
fuzz, it's actually very unclear at the callsite that we're going to
fill up that `Vec` up to its `capacity()`, because that's what
`IoBuf::bytes_total()` returns and that's what
`VirtualFile::read_exact_at` fills.
By passing a `Slice` instead, a caller that "just wants to read into a
`Vec`" is forced to be explicit about it, adding either `slice_full()`
or `slice(x..y)`, and these methods panic if the read is outside of the
bounds of the `Vec::capacity()`.
Last, passing slices is more similar to what the `std::io` APIs look
like.
# Change 2: fix UB in `virtual_file_io_engine=std-fs`
While reviewing call sites, I noticed that the
`io_engine::IoEngine::read_at` method for `StdFs` mode has been
constructing an `&mut[u8]` from raw parts that were uninitialized.
We then used `std::fs::File::read_exact` to initialize that memory, but,
IIUC we must not even be constructing an `&mut[u8]` where some of the
memory isn't initialized.
So, stop doing that and add a helper ext trait on `Slice` to do the
zero-initialization.
# Change 3: eliminate `read_exact_at_n`
The `read_exact_at_n` doesn't make sense because the caller can just
1. `slice = buf.slice()` the exact memory it wants to fill
2. `slice = read_exact_at(slice)`
3. `buf = slice.into_inner()`
Again, the `std::io` APIs specify the length of the read via the Rust
slice length.
We should do the same for the owned buffers IO APIs, i.e., via
`Slice::bytes_total()`.
# Change 4: simplify filling of `PageWriteGuard`
The `PageWriteGuardBuf::init_up_to` was never necessary.
Remove it. See changes to doc comment for more details.
---
Reviewers should probably look at the added test case first, it
illustrates my case a bit.
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
Release notes: https://blog.rust-lang.org/2024/05/02/Rust-1.78.0.html
Prior update was in #7198
part of https://github.com/neondatabase/neon/issues/7124
Changes
-------
This PR replaces the `EphemeralFile::write_blob`-specifc `struct Writer`
with re-use of `owned_buffers_io::write::BufferedWriter`.
Further, it restructures the code to cleanly separate
* the high-level aspect of EphemeralFile's write_blob / read_blk API
* the page-caching aspect
* the aspect of IO
* performing buffered write IO to an underlying VirtualFile
* serving reads from either the VirtualFile or the buffer if it hasn't
been flushed yet
* the annoying "feature" that reads past the end of the written range
are allowed and expected to return zeroed memory, as long as one remains
within one PAGE_SZ
Manual testing of the changes in #7160 revealed that, if the
thread-local destructor ever runs (it apparently doesn't in our test
suite runs, otherwise #7160 would not have auto-merged), we can
encounter an `abort()` due to a double-panic in the tracing code.
This github comment here contains the stack trace:
https://github.com/neondatabase/neon/pull/7160#issuecomment-2003778176
This PR reverts #7160 and uses a atomic counter to identify the
thread-local in log messages, instead of the memory address of the
thread local, which may be re-used.
The PR #7141 added log message
```
ThreadLocalState is being dropped and id might be re-used in the future
```
which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.
Testing
-------
Reproduced the issue locally and verified that this PR fixes the issue.
refs https://github.com/neondatabase/neon/issues/7136
Problem
-------
Before this PR, we were using
`tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure
As we've learned in [the
past](https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391),
some older Linux kernels account io_uring instances as locked memory.
And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d3d0)
which did away with the last mass-creation of tokio-epoll-uring
instances as per
commit 3da410c8fe
Author: Christian Schwarz <christian@neon.tech>
Date: Tue Mar 5 10:03:54 2024 +0100
tokio-epoll-uring: use it on the layer-creating code paths (#6378)
Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken
state.
Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.
Solution
--------
This PR implements a custom thread_local_system() that is
pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert
This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
# Problem
On-demand downloads are still using `tokio::fs`, which we know is
inefficient.
# Changes
- Add `pagebench ondemand-download-churn` to quantify on-demand download
throughput
- Requires dumping layer map, which required making `history_buffer`
impl `Deserialize`
- Implement an equivalent of `tokio::io::copy_buf` for owned buffers =>
`owned_buffers_io` module and children.
- Make layer file download sensitive to `io_engine::get()`, using
VirtualFile + above copy loop
- For this, I had to move some code into the `retry_download`, e.g.,
`sync_all()` call.
Drive-by:
- fix missing escaping in `scripts/ps_ec2_setup_instance_store`
- if we failed in retry_download to create a file, we'd try to remove
it, encounter `NotFound`, and `abort()` the process using
`on_fatal_io_error`. This PR adds treats `NotFound` as a success.
# Testing
Functional
- The copy loop is generic & unit tested.
Performance
- Used the `ondemand-download-churn` benchmark to manually test against
real S3.
- Results (public Notion page):
https://neondatabase.notion.site/Benchmarking-tokio-epoll-uring-on-demand-downloads-2024-04-15-newer-code-03c0fdc475c54492b44d9627b6e4e710?pvs=4
- Performance is equivalent at low concurrency. Jumpier situation at
high concurrency, but, still less CPU / throughput with
tokio-epoll-uring.
- It’s a win.
# Future Work
Turn the manual performance testing described in the above results
document into a performance regression test:
https://github.com/neondatabase/neon/issues/7146
fixes https://github.com/neondatabase/neon/issues/7116
Changes:
- refactor PageServerConfigBuilder: support not-set values
- implement runtime feature test
- use runtime feature test to determine `virtual_file_io_engine` if not
explicitly configured in the config
- log the effective engine at startup
- drive-by: improve assertion messages in `test_pageserver_init_node_id`
This needed a tiny bit of tokio-epoll-uring work, hence bumping it.
Changelog:
```
git log --no-decorate --oneline --reverse 868d2c42b5d54ca82fead6e8f2f233b69a540d3e..342ddd197a060a8354e8f11f4d12994419fff939
c7a74c6 Bump mio from 0.8.8 to 0.8.11
4df3466 Bump mio from 0.8.8 to 0.8.11 (#47)
342ddd1 lifecycle: expose `LaunchResult` enum (#49)
```
Before this PR, the layer file download code would fsync the inode after
rename instead of the timeline directory. That is not in line with what
a comment further up says we're doing, and it's obviously not achieving
the goal of making the rename durable.
part of https://github.com/neondatabase/neon/issues/6663
part of #6663
See that epic for more context & related commits.
Problem
-------
Before this PR, the layer-file-creating code paths were using
VirtualFile, but under the hood these were still blocking system calls.
Generally this meant we'd stall the executor thread, unless the caller
"knew" and used the following pattern instead:
```
spawn_blocking(|| {
Handle::block_on(async {
VirtualFile::....().await;
})
}).await
```
Solution
--------
This PR adopts `tokio-epoll-uring` on the layer-file-creating code paths
in pageserver.
Note that on-demand downloads still use `tokio::fs`, these will be
converted in a future PR.
Design: Avoiding Regressions With `std-fs`
------------------------------------------
If we make the VirtualFile write path truly async using
`tokio-epoll-uring`, should we then remove the `spawn_blocking` +
`Handle::block_on` usage upstack in the same commit?
No, because if we’re still using the `std-fs` io engine, we’d then block
the executor in those places where previously we were protecting us from
that through the `spawn_blocking` .
So, if we want to see benefits from `tokio-epoll-uring` on the write
path while also preserving the ability to switch between
`tokio-epoll-uring` and `std-fs` , where `std-fs` will behave identical
to what we have now, we need to ***conditionally* use `spawn_blocking +
Handle::block_on`** .
I.e., in the places where we use that know, we’ll need to make that
conditional based on the currently configured io engine.
It boils down to investigating all the places where we do
`spawn_blocking(... block_on(... VirtualFile::...))`.
Detailed [write-up of that investigation in
Notion](https://neondatabase.notion.site/Surveying-VirtualFile-write-path-usage-wrt-tokio-epoll-uring-integration-spawn_blocking-Handle-bl-5dc2270dbb764db7b2e60803f375e015?pvs=4
), made publicly accessible.
tl;dr: Preceding PRs addressed the relevant call sites:
- `metadata` file: turns out we could simply remove it (#6777, #6769,
#6775)
- `create_delta_layer()`: made sensitive to `virtual_file_io_engine` in
#6986
NB: once we are switched over to `tokio-epoll-uring` everywhere in
production, we can deprecate `std-fs`; to keep macOS support, we can use
`tokio::fs` instead. That will remove this whole headache.
Code Changes In This PR
-----------------------
- VirtualFile API changes
- `VirtualFile::write_at`
- implement an `ioengine` operation and switch `VirtualFile::write_at`
to it
- `VirtualFile::metadata()`
- curiously, we only use it from the layer writers' `finish()` methods
- introduce a wrapper `Metadata` enum because `std::fs::Metadata` cannot
be constructed by code outside rust std
- `VirtualFile::sync_all()` and for completeness sake, add
`VirtualFile::sync_data()`
Testing & Rollout
-----------------
Before merging this PR, we ran the CI with both io engines.
Additionally, the changes will soak in staging.
We could have a feature gate / add a new io engine
`tokio-epoll-uring-write-path` to do a gradual rollout. However, that's
not part of this PR.
Future Work
-----------
There's still some use of `std::fs` and/or `tokio::fs` for directory
namespace operations, e.g. `std::fs::rename`.
We're not addressing those in this PR, as we'll need to add the support
in tokio-epoll-uring first. Note that rename itself is usually fast if
the directory is in the kernel dentry cache, and only the fsync after
rename is slow. These fsyncs are using tokio-epoll-uring, so, the impact
should be small.
This PR adds an API to live-reconfigure the VirtualFile io engine.
It also adds a flag to `pagebench get-page-latest-lsn`, which is where I
found this functionality to be useful: it helps compare the io engines
in a benchmark without re-compiling a release build, which took ~50s on
the i3en.3xlarge where I was doing the benchmark.
Switching the IO engine is completely safe at runtime.