When walproposer observes now higher term it restarts instead of
crashing whole compute with PANIC; this avoids compute crash after
term_bump call. After successfull election we're still checking
last_log_term of the highest given vote to ensure basebackup is good,
and PANIC otherwise.
It will be used for migration per
035-safekeeper-dynamic-membership-change.md
and
https://github.com/neondatabase/docs/pull/21
ref https://github.com/neondatabase/neon/issues/8700
Check that truncation point is not from the future by comparing it with
write_record_lsn, not write_lsn, and explain that xlog switch changes
their normal order.
ref https://github.com/neondatabase/neon/issues/8911
Addresses the 1.82 beta clippy lint `too_long_first_doc_paragraph` by
adding newlines to the first sentence if it is short enough, and making
a short first sentence if there is the need.
wal_storage.rs already checks this, but since this is a quite legit scenario
check it at safekeeper.rs (consensus level) as well.
ref https://github.com/neondatabase/neon/issues/8212
This is a take 2; previous PR #8640 had been reverted because interplay
with another change broke test_last_log_term_switch.
Endpoint implementation sends msg to manager requesting to do the
reset. Manager stops current partial backup upload task if it exists and
performs the reset.
Also slightly tweak eviction condition: all full segments before
flush_lsn must be uploaded (and committed) and there must be only one
segment left on disk (partial). This allows to evict timelines which
started not on the first segment and didn't fill the whole
segment (previous condition wasn't good because last_removed_segno was
0).
ref https://github.com/neondatabase/neon/issues/8759
Removes additional async_trait usages from safekeeper and neon_local.
Also removes now redundant dependencies of the `async_trait` crate.
cc earlier work: #6305, #6464, #7303, #7342, #7212, #8296
Routes and their handlers were in a bit different order in 1) routes
list 2) their implementation 3) python client 4) openapi spec, making
addition of new ones intimidating. Make it the same everywhere, roughly
lexicographically but preserving some of existing logic.
No functional changes.
Previously, we protected from multiple ProposerElected messages from the same
walproposer with the following condition:
msg.term == self.get_last_log_term() && self.flush_lsn() >
msg.start_streaming_at
It is not exhaustive, i.e. we could still proceed to truncating WAL even though
safekeeper inserted something since the divergence point has been
calculated. While it was most likely safe because walproposer can't use
safekeeper position to commit WAL until last_log_term reaches the current
walproposer term, let's be more careful and properly calculate the divergence
point like walproposer does.
## Problem
The control file contains the id of the safekeeper that uploaded it.
Previously, when sending a snapshot of the control file to another sk,
it would eventually be gc-ed by the receiving sk. This is incorrect
because the original sk might still need it later.
## Summary of Changes
When sending a snapshot and the control file contains an uploaded
segment:
* Create a copy of the segment in s3 with the destination sk in the
object name
* Tweak the streamed control file to point to the object create in the
previous step
Note that the snapshot endpoint now has to know the id of the requestor,
so the api has been extended to include the node if of the destination
sk.
Closes https://github.com/neondatabase/neon/issues/8542
## Problem
There is an unused safekeeper option `partial_backup_enabled`.
`partial_backup_enabled` was implemented in #6530, but this option was
always turned into enabled in #8022.
If you intended to keep this option for a specific reason, I will close
this PR.
## Summary of changes
I removed an unused safekeeper option `partial_backup_enabled`.
This commit tries to fix regular load spikes on staging, caused by too
many eviction and partial upload operations running at the same time.
Usually it was hapenning after restart, for partial backup the load was
delayed.
- Add a semaphore for evictions (2 permits by default)
- Rename `resident_since` to `evict_not_before` and smooth out the curve
by using random duration
- Use random duration in partial uploads as well
related to https://github.com/neondatabase/neon/issues/6338
some discussion in
https://neondb.slack.com/archives/C033RQ5SPDH/p1720601531744029
## Problem
The scrubber would like to check the highest mtime in a tenant's objects
as a safety check during purges. It recently switched to use
GenericRemoteStorage, so we need to expose that in the listing methods.
## Summary of changes
- In Listing.keys, return a ListingObject{} including a last_modified
field, instead of a RemotePath
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
PR #8299 has switched the storage scrubber to use
`DefaultCredentialsChain`. Now we do this for `remote_storage`, as it
allows us to use `remote_storage` from inside kubernetes. Most of the
diff is due to `GenericRemoteStorage::from_config` becoming `async fn`.
## Problem
After a shard split, the pageserver leaves the ancestor shard's content
in place. It may be referenced by child shards, but eventually child
shards will de-reference most ancestor layers as they write their own
data and do GC. We would like to eventually clean up those ancestor
layers to reclaim space.
## Summary of changes
- Extend the physical GC command with `--mode=full`, which includes
cleaning up unreferenced ancestor shard layers
- Add test `test_scrubber_physical_gc_ancestors`
- Remove colored log output: in testing this is irritating ANSI code
spam in logs, and in interactive use doesn't add much.
- Refactor storage controller API client code out of storcon_client into
a `storage_controller/client` crate
- During physical GC of ancestors, call into the storage controller to
check that the latest shards seen in S3 reflect the latest state of the
tenant, and there is no shard split in progress.
The error means that manager exited earlier than `ResidenceGuard` and
it's not unexpected with current deletion implementation. This commit
changes log level to reduse noise.
We have an issue that some partial uploaded segments can be actually
missing in remote storage. I found this issue when was looking at the
logs in staging, and it can be triggered by failed uploads:
1. Code tries to upload `SEG_TERM_LSN_LSN_sk5.partial`, but receives
error from S3
2. The failed attempt is saved to `segments` vec
3. After some time, the code tries to upload
`SEG_TERM_LSN_LSN_sk5.partial` again
4. This time the upload is successful and code calls `gc()` to delete
previous uploads
5. Since new object and old object share the same name, uploaded data
gets deleted from remote storage
This commit fixes the issue by patching `gc()` not to delete objects
with the same name as currently uploaded.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
new clippy warnings on nightly.
## Summary of changes
broken up each commit by warning type.
1. Remove some unnecessary refs.
2. In edition 2024, inference will default to `!` and not `()`.
3. Clippy complains about doc comment indentation
4. Fix `Trait + ?Sized` where `Trait: Sized`.
5. diesel_derives triggering `non_local_defintions`
## Problem
Follow up to https://github.com/neondatabase/neon/pull/8335, to improve
observability of how many evict/restores we are doing.
## Summary of changes
- Add `safekeeper_eviction_events_started_total` and
`safekeeper_eviction_events_completed_total`, with a "kind" label of
evict or restore. This gives us rates, and also ability to calculate how
many are in progress.
- Generalize SafekeeperMetrics test type to use the same helpers as
pageserver, and enable querying any metric.
- Read the new metrics at the end of the eviction test.
## Problem
- The condition for eviction is not time-based: it is possible for a
timeline to be restored in response to a client, that client times out,
and then as soon as the timeline is restored it is immediately evicted
again.
- There is no delay on eviction at startup of the safekeeper, so when it
starts up and sees many idle timelines, it does many evictions which
will likely be immediately restored when someone uses the timeline.
## Summary of changes
- Add `eviction_min_resident` parameter, and use it in
`ready_for_eviction` to avoid evictions if the timeline has been
resident for less than this period.
- This also implicitly delays evictions at startup for
`eviction_min_resident`
- Set this to a very low number for the existing eviction test, which
expects immediate eviction.
The default period is 15 minutes. The general reasoning for that is that
in the worst case where we thrash ~10k timelines on one safekeeper,
downloading 16MB for each one, we should set a period that would not
overwhelm the node's bandwidth.
## Problem
Safekeepers left running for a long time use a lot of memory (up to the
point of OOMing, on small nodes) for deleted timelines, because the
`Timeline` struct is kept alive as a guard against recreating deleted
timelines.
Closes: https://github.com/neondatabase/neon/issues/6810
## Summary of changes
- Create separate tombstones that just record a ttid and when the
timeline was deleted.
- Add a periodic housekeeping task that cleans up tombstones older than
a hardcoded TTL (24h)
I think this also makes https://github.com/neondatabase/neon/pull/6766
un-needed, as the tombstone is also checked during deletion.
I considered making the overall timeline map use an enum type containing
active or deleted, but having a separate map of tombstones avoids
bloating that map, so that calls like `get()` can still go straight to a
timeline without having to walk a hashmap that also contains tombstones.
Before this PR, `RemoteStorageConfig::from_toml` would support
deserializing an
empty `{}` TOML inline table to a `None`, otherwise try `Some()`.
We can instead let
* in proxy: let clap derive handle the Option
* in PS & SK: assume that if the field is specified, it must be a valid
RemtoeStorageConfig
(This PR started with a much simpler goal of factoring out the
`deserialize_item` function because I need that in another PR).
After https://github.com/neondatabase/neon/pull/8022 was deployed to
staging, I noticed many cases of timeouts. After inspecting the logs, I
realized that some operations are taking ~20 seconds and they're doing
while holding shared state lock. Usually it happens right after
redeploy, because compute reconnections put high load on disks. This
commit tries to improve observability around slow operations.
Non-observability changes:
- `TimelineState::finish_change` now skips update if nothing has changed
- `wal_residence_guard()` timeout is set to 30s
Fixes https://github.com/neondatabase/neon/issues/6337
Add safekeeper support to switch between `Present` and
`Offloaded(flush_lsn)` states. The offloading is disabled by default,
but can be controlled using new cmdline arguments:
```
--enable-offload
Enable automatic switching to offloaded state
--delete-offloaded-wal
Delete local WAL files after offloading. When disabled, they will be left on disk
--control-file-save-interval <CONTROL_FILE_SAVE_INTERVAL>
Pending updates to control file will be automatically saved after this interval [default: 300s]
```
Manager watches state updates and detects when there are no actvity on
the timeline and actual partial backup upload in remote storage. When
all conditions are met, the state can be switched to offloaded.
In `timeline.rs` there is `StateSK` enum to support switching between
states. When offloaded, code can access only control file structure and
cannot use `SafeKeeper` to accept new WAL.
`FullAccessTimeline` is now renamed to `WalResidentTimeline`. This
struct contains guard to notify manager about active tasks requiring
on-disk WAL access. All guards are issued by the manager, all requests
are sent via channel using `ManagerCtl`. When manager receives request
to issue a guard, it unevicts timeline if it's currently evicted.
Fixed a bug in partial WAL backup, it used `term` instead of
`last_log_term` previously.
After this commit is merged, next step is to roll this change out, as in
issue #6338.
Previously in safekeeper code, new segment file was opened without
truncate option. I don't think there is a reason to do it, this commit
replaces it with `File::create` to make it simpler and remove
`clippy::suspicious_open_options` linter warning.
This is a preparation for #8022, to make the PR both backwards and
foward compatible.
This commit adds `eviction_state` field to control file. Adds support
for reading it, but writes control file in old format where possible, to
keep the disk format forward compatible.
Note: in `patch_control_file`, new field gets serialized to json like
this:
- `"eviction_state": "Present"`
- `"eviction_state": {"Offloaded": "0/8F"}`
- Make safekeeper read SAFEKEEPER_AUTH_TOKEN env variable with JWT
token to connect to other safekeepers.
- Set it in neon_local when auth is enabled.
- Create simple rust http client supporting it, and use it in pull_timeline
implementation.
- Enable auth in all pull_timeline tests.
- Make sk http_client() by default generate safekeeper wide token, it makes
easier enabling auth in all tests by default.
- Add /snapshot http endpoing streaming tar archive timeline contents up to
flush_lsn.
- Add check that term doesn't change, corresponding test passes now.
- Also prepares infra to hold off WAL removal during the basebackup.
- Sprinkle fsyncs to persist the pull_timeline result.
ref https://github.com/neondatabase/neon/issues/6340
We had a random sleep in the beginning of partial backup task, which was
needed for the first partial backup deploy. It helped with gradual
upload of segments without causing network overload. Now partial backup
is deployed everywhere, so we don't need this random sleep anymore.
We also had an issue related to this, in which manager task was not shut
down for a long time. The cause of the issue is this random sleep that
didn't take timeline cancellation into account, meanwhile manager task
waited for partial backup to complete.
Fixes https://github.com/neondatabase/neon/issues/7967
The general partial backup idea is that each safekeeper keeps only one
partial segment in remote storage at a time. Sometimes this is not true,
for example if we uploaded object to S3 but got an error when tried to
remove the previous upload. In this case we still keep a list of all
potentially uploaded objects in safekeeper state.
This commit prints a warning to logs if there is too many objects in
safekeeper state. This is not expected and we should try to fix this
state, we can do this by running gc.
I haven't seen this being an issue anywhere, but printing a warning is
something that I wanted to do and forgot in initial PR.
This is a preparation for
https://github.com/neondatabase/neon/issues/6337.
The idea is to add FullAccessTimeline, which will act as a guard for
tasks requiring access to WAL files. Eviction will be blocked on these
tasks and WAL won't be deleted from disk until there is at least one
active FullAccessTimeline.
To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`.
After eviction is implemented, this function will be responsible for
downloading missing WAL file and waiting until the download finishes.
This commit also contains other small refactorings:
- Separate `get_tenant_dir` and `get_timeline_dir` functions for
building a local path. This is useful for looking at usages and finding
tasks requiring access to local filesystem.
- `timeline_manager` is now responsible for spawning all background
tasks
- WAL removal task is now spawned instantly after horizon is updated
epoch is a historical and potentially confusing name. It semantically means
lastLogTerm from the raft paper, so let's use it.
This commit changes only internal namings, not public interface (http).
Do pull_timeline while WAL is being removed. To this end
- extract pausable_failpoint to utils, sprinkle pull_timeline with it
- add 'checkpoint' sk http endpoint to force WAL removal.
After fixing checking for pull file status code test fails so far which is
expected.
I looked at the metrics from
https://github.com/neondatabase/neon/pull/7768 on staging and it seems
that manager does too many iterations. This is probably caused by
background job `remove_wal.rs` which iterates over all timelines and
tries to remove WAL and persist control file. This causes shared state
updates and wakes up the manager. The fix is to skip notifying about the
updates if nothing was updated.