## TLDR
This PR is a no-op. The changes are disabled by default.
## Problem
I. Currently we don't have a way to detect disk I/O failures from WAL
operations.
II.
We observe that the offloader fails to upload a segment due to race
conditions on XLOG SWITCH and PG start streaming WALs. wal_backup task
continously failing to upload a full segment while the segment remains
partial on the disk.
The consequence is that commit_lsn for all SKs move forward but
backup_lsn stays the same. Then, all SKs run out of disk space.
III.
We have discovered SK bugs where the WAL offload owner cannot keep up
with WAL backup/upload to S3, which results in an unbounded accumulation
of WAL segment files on the Safekeeper's disk until the disk becomes
full. This is a somewhat dangerous operation that is hard to recover
from because the Safekeeper cannot write its control files when it is
out of disk space. There are actually 2 problems here:
1. A single problematic timeline can take over the entire disk for the
SK
2. Once out of disk, it's difficult to recover SK
IV.
Neon reports certain storage errors as "critical" errors using a marco,
which will increment a counter/metric that can be used to raise alerts.
However, this metric isn't sliced by tenant and/or timeline today. We
need the tenant/timeline dimension to better respond to incidents and
for blast radius analysis.
## Summary of changes
I.
The PR adds a `safekeeper_wal_disk_io_errors ` which is incremented when
SK fails to create or flush WALs.
II.
To mitigate this issue, we will re-elect a new offloader if the current
offloader is lagging behind too much.
Each SK makes the decision locally but they are aware of each other's
commit and backup lsns.
The new algorithm is
- determine_offloader will pick a SK. say SK-1.
- Each SK checks
-- if commit_lsn - back_lsn > threshold,
-- -- remove SK-1 from the candidate and call determine_offloader again.
SK-1 will step down and all SKs will elect the same leader again.
After the backup is caught up, the leader will become SK-1 again.
This also helps when SK-1 is slow to backup.
I'll set the reelect backup lag to 4 GB later. Setting to 128 MB in dev
to trigger the code more frequently.
III.
This change addresses problem no. 1 by having the Safekeeper perform a
timeline disk utilization check check when processing WAL proposal
messages from Postgres/compute. The Safekeeper now rejects the WAL
proposal message, effectively stops writing more WAL for the timeline to
disk, if the existing WAL files for the timeline on the SK disk exceeds
a certain size (the default threshold is 100GB). The disk utilization is
calculated based on a `last_removed_segno` variable tracked by the
background task removing WAL files, which produces an accurate and
conservative estimate (>= than actual disk usage) of the actual disk
usage.
IV.
* Add a new metric `hadron_critical_storage_event_count` that has the
`tenant_shard_id` and `timeline_id` as dimensions.
* Modified the `crtitical!` marco to include tenant_id and timeline_id
as additional arguments and adapted existing call sites to populate the
tenant shard and timeline ID fields. The `critical!` marco invocation
now increments the `hadron_critical_storage_event_count` with the extra
dimensions. (In SK there isn't the notion of a tenant-shard, so just the
tenant ID is recorded in lieu of tenant shard ID.)
I considered adding a separate marco to avoid merge conflicts, but I
think in this case (detecting critical errors) conflicts are probably
more desirable so that we can be aware whenever Neon adds another
`critical!` invocation in their code.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: William Huang <william.huang@databricks.com>
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
Preparations for a successor of #10440:
* move `pull_timeline` to `safekeeper_api` and add it to
`SafekeeperClient`. we want to do `pull_timeline` on any creations that
we couldn't do initially.
* Add a `SafekeeperGeneration` type instead of relying on a type alias.
we want to maintain a safekeeper specific generation number now in the
storcon database. A separate type is important to make it impossible to
mix it up with the tenant's pageserver specific generation number. We
absolutely want to avoid that for correctness reasons. If someone mixes
up a safekeeper and pageserver id (both use the `NodeId` type), that's
bad but there is no wrong generations flying around.
part of #9011
## Problem
https://github.com/neondatabase/neon/issues/9965
## Summary of changes
Add safekeeper membership configuration struct itself and storing it in
the control file. In passing also add creation timestamp to the control
file (there were cases where I wanted it in the past).
Remove obsolete unused PersistedPeerInfo struct from control file (still
keep it control_file_upgrade.rs to have it in old upgrade code).
Remove the binary representation of cfile in the roundtrip test.
Updating it is annoying, and we still test the actual roundtrip.
Also add configuration to timeline creation http request, currently used
only in one python test. In passing, slightly change LSNs meaning in the
request: normally start_lsn is passed (the same as ancestor_start_lsn in
similar pageserver call), but we allow specifying higher commit_lsn for
manual intervention if needed. Also when given LSN initialize
term_history with it.
## Problem
The storage components take an entire `SafekeeperConf` during
construction, but only actually use the `no_sync` field. This makes it
hard to understand the storage inputs (which fields do they actually
care about?), and is also inconvenient for tests and benchmarks that
need to set up a lot of unnecessary boilerplate.
## Summary of changes
* Don't take the entire config, but pass in the `no_sync` field
explicitly.
* Take the timeline dir instead of `ttid` as an input, since it's the
only thing it cares about.
* Fix a couple of tests to not leak tempdirs.
* Various minor tweaks.
Always do timeline init through atomic rename of temp directory. Add
GlobalTimelines::load_temp_timeline which does this, and use it from
both pull_timeline and basic timeline creation. Fixes a collection
of issues:
- previously timeline creation didn't really flushed cfile to disk
due to 'nothing to do if state didn't change' check;
- even if it did, without tmp dir it is possible to lose the cfile
but leave timeline dir in place, making it look corrupted;
- tenant directory creation fsync was missing in timeline creation;
- pull_timeline is now protected from concurrent both itself and
timeline creation;
- now global timelines map entry got special CreationInProgress
entry type which prevents from anyone getting access to timeline
while it is being created (previously one could get access to it,
but it was locked during creation, which is valid but confusing if
creation failed).
fixes#8927
Removes additional async_trait usages from safekeeper and neon_local.
Also removes now redundant dependencies of the `async_trait` crate.
cc earlier work: #6305, #6464, #7303, #7342, #7212, #8296
## Problem
The control file contains the id of the safekeeper that uploaded it.
Previously, when sending a snapshot of the control file to another sk,
it would eventually be gc-ed by the receiving sk. This is incorrect
because the original sk might still need it later.
## Summary of Changes
When sending a snapshot and the control file contains an uploaded
segment:
* Create a copy of the segment in s3 with the destination sk in the
object name
* Tweak the streamed control file to point to the object create in the
previous step
Note that the snapshot endpoint now has to know the id of the requestor,
so the api has been extended to include the node if of the destination
sk.
Closes https://github.com/neondatabase/neon/issues/8542
Fixes https://github.com/neondatabase/neon/issues/6337
Add safekeeper support to switch between `Present` and
`Offloaded(flush_lsn)` states. The offloading is disabled by default,
but can be controlled using new cmdline arguments:
```
--enable-offload
Enable automatic switching to offloaded state
--delete-offloaded-wal
Delete local WAL files after offloading. When disabled, they will be left on disk
--control-file-save-interval <CONTROL_FILE_SAVE_INTERVAL>
Pending updates to control file will be automatically saved after this interval [default: 300s]
```
Manager watches state updates and detects when there are no actvity on
the timeline and actual partial backup upload in remote storage. When
all conditions are met, the state can be switched to offloaded.
In `timeline.rs` there is `StateSK` enum to support switching between
states. When offloaded, code can access only control file structure and
cannot use `SafeKeeper` to accept new WAL.
`FullAccessTimeline` is now renamed to `WalResidentTimeline`. This
struct contains guard to notify manager about active tasks requiring
on-disk WAL access. All guards are issued by the manager, all requests
are sent via channel using `ManagerCtl`. When manager receives request
to issue a guard, it unevicts timeline if it's currently evicted.
Fixed a bug in partial WAL backup, it used `term` instead of
`last_log_term` previously.
After this commit is merged, next step is to roll this change out, as in
issue #6338.
This is a preparation for #8022, to make the PR both backwards and
foward compatible.
This commit adds `eviction_state` field to control file. Adds support
for reading it, but writes control file in old format where possible, to
keep the disk format forward compatible.
Note: in `patch_control_file`, new field gets serialized to json like
this:
- `"eviction_state": "Present"`
- `"eviction_state": {"Offloaded": "0/8F"}`
- Add /snapshot http endpoing streaming tar archive timeline contents up to
flush_lsn.
- Add check that term doesn't change, corresponding test passes now.
- Also prepares infra to hold off WAL removal during the basebackup.
- Sprinkle fsyncs to persist the pull_timeline result.
ref https://github.com/neondatabase/neon/issues/6340
This is a preparation for
https://github.com/neondatabase/neon/issues/6337.
The idea is to add FullAccessTimeline, which will act as a guard for
tasks requiring access to WAL files. Eviction will be blocked on these
tasks and WAL won't be deleted from disk until there is at least one
active FullAccessTimeline.
To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`.
After eviction is implemented, this function will be responsible for
downloading missing WAL file and waiting until the download finishes.
This commit also contains other small refactorings:
- Separate `get_tenant_dir` and `get_timeline_dir` functions for
building a local path. This is useful for looking at usages and finding
tasks requiring access to local filesystem.
- `timeline_manager` is now responsible for spawning all background
tasks
- WAL removal task is now spawned instantly after horizon is updated
Add support for backing up partial segments to remote storage. Disabled
by default, can be enabled with `--partial-backup-enabled`.
Safekeeper timeline has a background task which is subscribed to
`commit_lsn` and `flush_lsn` updates. After the partial segment was
updated (`flush_lsn` was changed), the segment will be uploaded to S3 in
about 15 minutes.
The filename format for partial segments is
`Segment_Term_Flush_Commit_skNN.partial`, where:
- `Segment` – the segment name, like `000000010000000000000001`
- `Term` – current term
- `Flush` – flush_lsn in hex format `{:016X}`, e.g. `00000000346BC568`
- `Commit` – commit_lsn in the same hex format
- `NN` – safekeeper_id, like `1`
The full object name example:
`000000010000000000000002_2_0000000002534868_0000000002534410_sk1.partial`
Each safekeeper will keep info about remote partial segments in its
control file. Code updates state in the control file before doing any S3
operations. This way control file stores information about all
potentially existing remote partial segments and can clean them up after
uploading a newer version.
Closes#6336
Nightly has added a bunch of compiler and linter warnings. There is also
two dependencies that fail compilation on latest nightly due to using
the old `stdsimd` feature name. This PR fixes them.
Since fdatasync is used for flushing WAL, changing file size is unsafe. Make
segment creation atomic by using tmp file + rename to avoid using partially
initialized segments.
fixes https://github.com/neondatabase/neon/issues/6402
safekeeper.rs is mostly about consensus, but state is wider. Also form
SafekeeperState which encapsulates persistent part + in memory layer with API
for atomic updates.
Moves remote_consistent_lsn back to SafekeeperMemState, fixes its absense from
memory dump.
Also renames SafekeeperState to TimelinePersistentState, as TimelineMemState and
TimelinePersistent state are created.
Implement API for cloning a single timeline inside a safekeeper. Also
add API for calculating a sha256 hash of WAL, which is used in tests.
`/copy` API works by copying objects inside S3 for all but the last
segments, and the last segments are copied on-disk. A special temporary
directory is created for a timeline, because copy can take a lot of
time, especially for large timelines. After all files segments have been
prepared, this directory is mounted to the main tree and timeline is
loaded to memory.
Some caveats:
- large timelines can take a lot of time to copy, because we need to
copy many S3 segments
- caller should wait for HTTP call to finish indefinetely and don't
close the HTTP connection, because it will stop the process, which is
not continued in the background
- `until_lsn` must be a valid LSN, otherwise bad things can happen
- API will return 200 if specified `timeline_id` already exists, even if
it's not a copy
- each safekeeper will try to copy S3 segments, so it's better to not
call this API in-parallel on different safekeepers
Fixes#4689 by replacing all of `std::Path` , `std::PathBuf` with
`camino::Utf8Path`, `camino::Utf8PathBuf` in
- pageserver
- safekeeper
- control_plane
- libs/remote_storage
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
`cargo +nightly doc` is giving a lot of warnings: broken links, naked
URLs, etc.
## Summary of changes
* update the `proc-macro2` dependency so that it can compile on latest
Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and
https://github.com/dtolnay/proc-macro2/issues/398
* allow the `private_intra_doc_links` lint, as linking to something
that's private is always more useful than just mentioning it without a
link: if the link breaks in the future, at least there is a warning due
to that. Also, one might enable
[`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options)
in the future and make these links work in general.
* fix all the remaining warnings given by `cargo +nightly doc`
* make it possible to run `cargo doc` on stable Rust by updating
`opentelemetry` and associated crates to version 0.19, pulling in a fix
that previously broke `cargo doc` on stable:
https://github.com/open-telemetry/opentelemetry-rust/pull/904
* Add `cargo doc` to CI to ensure that it won't get broken in the
future.
Fixes#2557
## Future work
* Potentially, it might make sense, for development purposes, to publish
the generated rustdocs somewhere, like for example [how the rust
compiler does
it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html).
I will file an issue for discussion.
General Rust Write trait semantics (as well as its async brother) is that write
definitely happens only after Write::flush(). This wasn't needed in sync where
rust write calls the syscall directly, but is required in async.
Also fix setting initial end_pos in walsender, sometimes it was from the future.
fixes https://github.com/neondatabase/neon/issues/4518
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.
Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
careful not to drop task at random point, leaving timeline in unclear
state. Currently the only writer is walreceiver and we don't have top
level cancellation there, so we are good. But to be safe probably we should
add a fuse panicking if task is being dropped while operation on a timeline
is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
complains on timeline map lock being held across await points -- which is
not bad here as it happens only in single threaded runtime of main thread.
But having it sync doesn't hurt either.
I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.
fixes https://github.com/neondatabase/neon/issues/3036
fixes https://github.com/neondatabase/neon/issues/3966
It should make remote_consistent_lsn commonly up-to-date on non actively writing
projects, which removes spike or pageserver -> safekeeper reconnections on
storage nodes restart.
1.66 release speeds up compile times for over 10% according to tests.
Also its Clippy finds plenty of old nits in our code:
* useless conversion, `foo as u8` where `foo: u8` and similar, removed
`as u8` and similar
* useless references and dereferenced (that were automatically adjusted
by the compiler), removed various `&` and `*`
* bool -> u8 conversion via `if/else`, changed to `u8::from`
* Map `.iter()` calls where only values were used, changed to
`.values()` instead
Standing out lints:
* `Eq` is missing in our protoc generated structs. Silenced, does not
seem crucial for us.
* `fn default` looks like the one from `Default` trait, so I've
implemented that instead and replaced the `dummy_*` method in tests with
`::default()` invocation
* Clippy detected that
```
if retry_attempt < u32::MAX {
retry_attempt += 1;
}
```
is a saturating add and proposed to replace it.