# Problem
Before this PR, `test_pageserver_catchup_while_compute_down` would
occasionally fail due to scary-looking WARN log line
```
WARN ephemeral_file_buffered_writer{...}:flush_attempt{attempt=1}: \
error flushing buffered writer buffer to disk, retrying after backoff err=Operation canceled (os error 125)
```
After lengthy investigation, the conclusion is that this is likely due
to a kernel bug related due to io_uring async workers (io-wq) and
signals.
The main indicator is that the error only ever happens in correlation
with pageserver shtudown when SIGTERM is received.
There is a fix that is merged in 6.14
kernels (`io-wq: backoff when retrying worker creation`).
However, even when I revert that patch, the issue is not reproducible
on 6.14, so, it remains a speculation.
It was ruled out that the ECANCELED is due to the executor thread
exiting before the async worker starts processing the operation.
# Solution
The workaround in this issue is to retry the operation on ECANCELED
once.
Retries are safe because the low-level io_engine operations are
idempotent.
(We don't use O_APPEND and I can't think of another flag that would make
the APIs covered by this patch not idempotent.)
# Testing
With this PR, the warn! log no longer happens on [my reproducer
setup](https://github.com/neondatabase/neon/issues/11446#issuecomment-2843015111).
And the new rate-limited `info!`-level log line informing about the
internal retry shows up instead, as expected.
# Refs
- fixes https://github.com/neondatabase/neon/issues/11446
# Improve OpenOptions API ergonomics
Closes#11787
This PR improves the OpenOptions API ergonomics by:
1. Making OpenOptions methods take and return owned Self instead of &mut
self
2. Changing VirtualFile::open_with_options_v2 to take an owned
OpenOptions
3. Removing unnecessary .clone() and .to_owned() calls
These changes make the API more idiomatic Rust by leveraging the builder
pattern with owned values, which is cleaner and more ergonomic than the
previous approach.
Link to Devin run:
https://app.devin.ai/sessions/c2a4b24f7aca40a3b3777f4259bf8ee1
Requested by: christian@neon.tech
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: christian@neon.tech <christian@neon.tech>
# Problem
The Pageserver read path exclusively uses direct IO if
`virtual_file_io_mode=direct`.
The write path is half-finished. Here is what the various writing
components use:
|what|buffering|flags on <br/>`v_f_io_mode`<br/>=`buffered`|flags on
<br/>`virtual_file_io_mode`<br/>=`direct`|
|-|-|-|-|
|`DeltaLayerWriter`| BlobWriter<BUFFERED=true> | () | () |
|`ImageLayerWriter`| BlobWriter<BUFFERED=false> | () | () |
|`download_layer_file`|BufferedWriter|()|()|
|`InMemoryLayer`|BufferedWriter|()|O_DIRECT|
The vehicle towards direct IO support is `BufferedWriter` which
- largely takes care of O_DIRECT alignment & size-multiple requirements
- double-buffering to mask latency
`DeltaLayerWriter`, `ImageLayerWriter` use `blob_io::BlobWriter` , which
has neither of these.
# Changes
## High-Level
At a high-level this PR makes the following primary changes:
- switch the two layer writer types to use `BufferedWriter` & make
sensitive to `virtual_file_io_mode` (via open_with_options_**v2**)
- make `download_layer_file` sensitive to `virtual_file_io_mode` (also
via open_with_options_**v2**)
- add `virtual_file_io_mode=direct-rw` as a feature gate
- we're hackish-ly piggybacking on OpenOptions's ask for write access
here
- this means with just `=direct` InMemoryLayer reads and writes no
longer uses O_DIRECT
- this is transitory and we'll remove the `direct-rw` variant once the
rollout is complete
(The `_v2` APIs for opening / creating VirtualFile are those that are
sensitive to `virtual_file_io_mode`)
The result is:
|what|uses <br/>`BufferedWriter`|flags on
<br/>`v_f_io_mode`<br/>=`buffered`|flags on
<br/>`v_f_io_mode`<br/>=`direct`|flags on
<br/>`v_f_io_mode`<br/>=`direct-rw`|
|-|-|-|-|-|
|`DeltaLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT |
|`ImageLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT |
|`download_layer_file`|BufferedWriter|()|()|O_DIRECT|
|`InMemoryLayer`|BufferedWriter|()|~~O_DIRECT~~()|O_DIRECT|
## Code-Level
The main change is:
- Switch `blob_io::BlobWriter` away from its own buffering method to use
`BufferedWriter`.
Additional prep for upholding `O_DIRECT` requirements:
- Layer writer `finish()` methods switched to use IoBufferMut for
guaranteed buffer address alignment. The size of the buffers is PAGE_SZ
and thereby implicitly assumed to fulfill O_DIRECT requirements.
For the hacky feature-gating via `=direct-rw`:
- Track `OpenOptions::write(true|false)` in a field; bunch of mechanical
churn.
- Consolidate the APIs in which we "open" or "create" VirtualFile for
better overview over which parts of the code use the `_v2` APIs.
Necessary refactorings & infra work:
- Add doc comments explaining how BufferedWriter ensures that writes are
compliant with O_DIRECT alignment & size constraints. This isn't new,
but should be spelled out.
- Add the concept of shutdown modes to `BufferedWriter::shutdown` to
make writer shutdown adhere to these constraints.
- The `PadThenTruncate` mode might not be necessary in practice because
I believe all layer files ever written are sized in multiples `PAGE_SZ`
and since `PAGE_SZ` is larger than the current alignment requirements
(512/4k depending on platform), it won't be necesary to pad.
- Some test (I believe `round_trip_test_compressed`?) required it though
- [ ] TODO: decide if we want to accept that complexity; if we do then
address TODO in the code to separate alignment requirement from buffer
capacity
- Add `set_len` (=`ftruncate`) VirtualFile operation to support the
above.
- Allow `BufferedWriter` to start at a non-zero offset (to make room for
the summary block).
Cleanups unlocked by this change:
- Remove non-positional APIs from VirtualFile (e.g. seek, write_full,
read_full)
Drive-by fixes:
- PR https://github.com/neondatabase/neon/pull/11585 aimed to run unit
tests for all `virtual_file_io_mode` combinations but didn't because of
a missing `_` in the env var.
# Performance
This section assesses this PR's impact on deployments with current
production setting (`=direct`) and anticipated impact of switching to
(`=direct-rw`).
For `DeltaLayerWriter`, `=direct` should remain unchanged to slightly
improved on throughput because the `BlobWriter`'s buffer had the same
size as the `BufferedWriter`'s buffer, but it didn't have the
double-buffering that `BufferedWriter` has.
The `=direct-rw` enables direct IO; throughput should not be suffering
because of double-buffering; benchmarks will show if this is true.
The `ImageLayerWriter` was previously not doing any buffering
(`BUFFERED=false`).
It went straight to issuing the IO operation to the underlying
VirtualFile and the buffering was done by the kernel.
The switch to `BufferedWriter` under `=direct` adds an additional memcpy
into the BufferedWriter's buffer.
We will win back that memcpy when enabling direct IO via `=direct-rw`.
A nice win from the switch to `BufferedWriter` is that ImageLayerWriter
performs >=16x fewer write operations to VirtualFile (the BlobWriter
performs one write per len field and one write per image value).
This should save low tens of microseconds of CPU overhead from doing all
these syscalls/io_uring operations, regardless of `=direct` or
`=direct-rw`.
Aside from problems with alignment, this write frequency without
double-buffering is prohibitive if we actually have to wait for the
disk, which is what will happen when we enable direct IO via
(`=direct-rw`).
Throughput should not be suffering because of BufferedWrite's
double-buffering; benchmarks will show if this is true.
`InMemoryLayer` at `=direct` will flip back to using buffered IO but
remain on BufferedWriter.
The buffered IO adds back one memcpy of CPU overhead.
Throughput should not suffer and will might improve on
not-memory-pressured Pageservers but let's remember that we're doing the
whole direct IO thing to eliminate global memory pressure as a source of
perf variability.
## bench_ingest
I reran `bench_ingest` on `im4gn.2xlarge` and `Hetzner AX102`.
Use `git diff` with `--word-diff` or similar to see the change.
General guidance on interpretation:
- immediate production impact of this PR without production config
change can be gauged by comparing the same `io_mode=Direct`
- end state of production switched over to `io_mode=DirectRw` can be
gauged by comparing old results' `io_mode=Direct` to new results'
`io_mode=DirectRw`
Given above guidance, on `im4gn.2xlarge`
- immediate impact is a significant improvement in all cases
- end state after switching has same significant improvements in all
cases
- ... except `ingest/io_mode=DirectRw volume_mib=128 key_size_bytes=8192
key_layout=Sequential write_delta=Yes` which only achieves `238 MiB/s`
instead of `253.43 MiB/s`
- this is a 6% degradation
- this workload is typical for image layer creation
# Refs
- epic https://github.com/neondatabase/neon/issues/9868
- stacked atop
- preliminary refactor https://github.com/neondatabase/neon/pull/11549
- bench_ingest overhaul https://github.com/neondatabase/neon/pull/11667
- derived from https://github.com/neondatabase/neon/pull/10063
Co-authored-by: Yuchen Liang <yuchen@neon.tech>
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
refs https://github.com/neondatabase/neon/issues/7136
Problem
-------
Before this PR, we were using
`tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure
As we've learned in [the
past](https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391),
some older Linux kernels account io_uring instances as locked memory.
And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d3d0)
which did away with the last mass-creation of tokio-epoll-uring
instances as per
commit 3da410c8fe
Author: Christian Schwarz <christian@neon.tech>
Date: Tue Mar 5 10:03:54 2024 +0100
tokio-epoll-uring: use it on the layer-creating code paths (#6378)
Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken
state.
Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.
Solution
--------
This PR implements a custom thread_local_system() that is
pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert
This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
This PR adds an API to live-reconfigure the VirtualFile io engine.
It also adds a flag to `pagebench get-page-latest-lsn`, which is where I
found this functionality to be useful: it helps compare the io engines
in a benchmark without re-compiling a release build, which took ~50s on
the i3en.3xlarge where I was doing the benchmark.
Switching the IO engine is completely safe at runtime.