Earlier work (#7547) has made the scrubber internally generic, but one
could only configure it to use S3 storage.
This is the final piece to make (most of, snapshotting still requires
S3) the scrubber be able to be configured via GenericRemoteStorage.
I.e. you can now set an env var like:
```
REMOTE_STORAGE_CONFIG='remote_storage = { bucket_name = "neon-dev-safekeeper-us-east-2d", bucket_region = "us-east-2" }
```
and the scrubber will read it instead.
## Problem
`no_sync` initially just skipped syncfs on startup (#9677). I'm also
interested in flaky tests that time out during pageserver shutdown while
flushing l0s, so to eliminate disk throughput as a source of issues
there,
## Summary of changes
- Drive-by change for test timeouts: add a couple more ::info logs
during pageserver startup so it's obvious which part got stuck.
- Add a SyncMode enum to configure VirtualFile and respect it in
sync_all and sync_data functions
- During pageserver startup, set SyncMode according to `no_sync`
Part of #8130
## Problem
Pageserver previously goes through the kernel page cache for all the
IOs. The kernel page cache makes light-loaded pageserver have deceptive
fast performance. Using direct IO would offer predictable latencies of
our virtual file IO operations.
In particular for reads, the data pages also have an extremely low
temporal locality because the most frequently accessed pages are cached
on the compute side.
## Summary of changes
This PR enables pageserver to use direct IO for delta layer and image
layer reads. We can ship them separately because these layers are
write-once, read-many, so we will not be mixing buffered IO with direct
IO.
- implement `IoBufferMut`, an buffer type with aligned allocation
(currently set to 512).
- use `IoBufferMut` at all places we are doing reads on image + delta
layers.
- leverage Rust type system and use `IoBufAlignedMut` marker trait to
guarantee that the input buffers for the IO operations are aligned.
- page cache allocation is also made aligned.
_* in-memory layer reads and the write path will be shipped separately._
## Testing
Integration test suite run with O_DIRECT enabled:
https://github.com/neondatabase/neon/pull/9350
## Performance
We evaluated performance based on the `get-page-at-latest-lsn`
benchmark. The results demonstrate a decrease in the number of IOps, no
sigificant change in the latency mean, and an slight improvement on the
p99.9 and p99.99 latencies.
[Benchmark](https://www.notion.so/neondatabase/Benchmark-O_DIRECT-for-image-and-delta-layers-2024-10-01-112f189e00478092a195ea5a0137e706?pvs=4)
## Rollout
We will add `virtual_file_io_mode=direct` region by region to enable
direct IO on image + delta layers.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
We need a way to incrementally switch to direct IO. During the rollout
we might want to switch to O_DIRECT on image and delta layer read path
first before others.
## Summary of changes
- Revisited and simplified direct io config in `PageserverConf`.
- We could add a fallback mode for open, but for read there isn't a
reasonable alternative (without creating another buffered virtual file).
- Added a wrapper around `VirtualFile`, current implementation become
`VirtualFileInner`
- Use `open_v2`, `create_v2`, `open_with_options_v2` when we want to use
the IO mode specified in PS config.
- Once we onboard all IO through VirtualFile using this new API, we will
delete the old code path.
- Make io mode live configurable for benchmarking.
- Only guaranteed for files opened after the config change, so do it
before the experiment.
As an example, we are using `open_v2` with
`virtual_file::IoMode::Direct` in
https://github.com/neondatabase/neon/pull/9169
We also remove `io_buffer_alignment` config in
a04cfd754b and use it as a compile time
constant. This way we don't have to carry the alignment around or make
frequent call to retrieve this information from the static variable.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
This PR simplifies the pageserver configuration parsing as follows:
* introduce the `pageserver_api::config::ConfigToml` type
* implement `Default` for `ConfigToml`
* use serde derive to do the brain-dead leg-work of processing the toml
document
* use `serde(default)` to fill in default values
* in `pageserver` crate:
* use `toml_edit` to deserialize the pageserver.toml string into a
`ConfigToml`
* `PageServerConfig::parse_and_validate` then
* consumes the `ConfigToml`
* destructures it exhaustively into its constituent fields
* constructs the `PageServerConfig`
The rules are:
* in `ConfigToml`, use `deny_unknown_fields` everywhere
* static default values go in `pageserver_api`
* if there cannot be a static default value (e.g. which default IO
engine to use, because it depends on the runtime), make the field in
`ConfigToml` an `Option`
* if runtime-augmentation of a value is needed, do that in
`parse_and_validate`
* a good example is `virtual_file_io_engine` or `l0_flush`, both of
which need to execute code to determine the effective value in
`PageServerConf`
The benefits:
* massive amount of brain-dead repetitive code can be deleted
* "unused variable" compile-time errors when removing a config value,
due to the exhaustive destructuring in `parse_and_validate`
* compile-time errors guide you when adding a new config field
Drawbacks:
* serde derive is sometimes a bit too magical
* `deny_unknown_fields` is easy to miss
Future Work / Benefits:
* make `neon_local` use `pageserver_api` to construct `ConfigToml` and
write it to `pageserver.toml`
* This provides more type safety / coompile-time errors than the current
approach.
### Refs
Fixes#3682
### Future Work
* `remote_storage` deser doesn't reject unknown fields
https://github.com/neondatabase/neon/issues/8915
* clean up `libs/pageserver_api/src/config.rs` further
* break up into multiple files, at least for tenant config
* move `models` as appropriate / refine distinction between config and
API models / be explicit about when it's the same
* use `pub(crate)` visibility on `mod defaults` to detect stale values
Part of #8130, closes#8719.
## Problem
Currently, vectored blob io only coalesce blocks if they are immediately
adjacent to each other. When we switch to Direct IO, we need a way to
coalesce blobs that are within the dio-aligned boundary but has gap
between them.
## Summary of changes
- Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and
`StreamingVectoredReadPlanner` which has two modes:
- `AdjacentOnly` (current implementation)
- `Chunked(<alignment requirement>)`
- New `ChunkedVectorBuilder` that considers batching `dio-align`-sized
read, the start and end of the vectored read will respect
`stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and
`vectored_read.blobs_at.first().start_offset` will be two different
value).
- Since we break the assumption that blobs within single `VectoredRead`
are next to each other (implicit end offset), we start to store blob end
offsets in the `VectoredRead`.
- Adapted existing tests to run in both `VectoredReadCoalesceMode`.
- The io alignment can also be live configured at runtime.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
PR #8299 has switched the storage scrubber to use
`DefaultCredentialsChain`. Now we do this for `remote_storage`, as it
allows us to use `remote_storage` from inside kubernetes. Most of the
diff is due to `GenericRemoteStorage::from_config` becoming `async fn`.
Before this PR, `RemoteStorageConfig::from_toml` would support
deserializing an
empty `{}` TOML inline table to a `None`, otherwise try `Some()`.
We can instead let
* in proxy: let clap derive handle the Option
* in PS & SK: assume that if the field is specified, it must be a valid
RemtoeStorageConfig
(This PR started with a much simpler goal of factoring out the
`deserialize_item` function because I need that in another PR).
Part of https://github.com/neondatabase/neon/issues/7462
## Summary of changes
Tenant config is not persisted unless it's attached on the storage
controller. In this pull request, we persist the aux file policy flag in
the `index_part.json`.
Admins can set `switch_aux_file_policy` in the storage controller or
using the page server API. Upon the first aux file gets written, the
write path will compare the aux file policy target with the current
policy. If it is switch-able, we will do the switch. Otherwise, the
original policy will be used. The test cases show what the admins can do
/ cannot do.
The `last_aux_file_policy` is stored in `IndexPart`. Updates to the
persisted policy are done via
`schedule_index_upload_for_aux_file_policy_update`. On the write path,
the writer will update the field.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Adds another tool to the DR toolbox: ability in pagectl to
recover arbitrary prefixes in remote storage. Requires remote storage config,
the prefix, and the travel-to timestamp parameter
to be specified as cli args.
The done-if-after parameter is also supported.
Example invocation (after `aws login --profile dev`):
```
RUST_LOG=remote_storage=debug AWS_PROFILE=dev cargo run -p pagectl time-travel-remote-prefix 'remote_storage = { bucket_name = "neon-test-bucket-name", bucket_region = "us-east-2" }' wal/3aa8fcc61f6d357410b7de754b1d9001/641e5342083b2235ee3deb8066819683/ 2024-04-05T17:00:00Z
```
This has been written to resolve a customer recovery case:
https://neondb.slack.com/archives/C033RQ5SPDH/p1712256888468009
There is validation of the prefix to prevent accidentially specifying
too generic prefixes, which can cause corruption and data
loss if used wrongly. Still, the validation is not perfect and it is
important that the command is used with caution.
If possible, `time_travel_remote_storage` should
be used instead which has additional checks in place.
This PR adds an API to live-reconfigure the VirtualFile io engine.
It also adds a flag to `pagebench get-page-latest-lsn`, which is where I
found this functionality to be useful: it helps compare the io engines
in a benchmark without re-compiling a release build, which took ~50s on
the i3en.3xlarge where I was doing the benchmark.
Switching the IO engine is completely safe at runtime.
fixes https://github.com/neondatabase/neon/issues/5878
obsoletes https://github.com/neondatabase/neon/issues/5879
Before this PR, it could happen that `load_layer_map` schedules removal
of the future
image layer. Then a later compaction run could re-create the same image
layer, scheduling a PUT.
Due to lack of an upload queue barrier, the PUT and DELETE could be
re-ordered.
The result was IndexPart referencing a non-existent object.
## Summary of changes
* Add support to `pagectl` / Python tests to decode `IndexPart`
* Rust
* new `pagectl` Subcommand
* `IndexPart::{from,to}_s3_bytes()` methods to internalize knowledge
about encoding of `IndexPart`
* Python
* new `NeonCli` subclass
* Add regression test
* Rust
* Ability to force repartitioning; required to ensure image layer
creation at last_record_lsn
* Python
* The regression test.
* Fix the issue
* Insert an `UploadOp::Barrier` after scheduling the deletions.
Fixes#4689 by replacing all of `std::Path` , `std::PathBuf` with
`camino::Utf8Path`, `camino::Utf8PathBuf` in
- pageserver
- safekeeper
- control_plane
- libs/remote_storage
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
In #4743, we'd like to convert the read path to use `async` rust. In
preparation of that, this PR switches some functions that are calling
lower level functions like `BlockReader::read_blk`,
`BlockCursor::read_blob`, etc into `async`. The PR does not switch all
functions however, and only focuses on the ones which are easy to
switch.
This leaves around some async functions that are (currently)
unnecessarily `async`, but on the other hand it makes future changes
smaller in diff.
Part of #4743 (but does not completely address it).
This PR refactors the original page_binutils with a single tool pagectl,
use clap derive for better command line parsing, and adds the dump kv
tool to extract information from delta file. This helps me better
understand what's inside the page server. We can add support for other
types of file and more functionalities in the future.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>