The LSN lease api actually accepts a tenant_shard_id, not a tenant_id.
But we put the Display of the tenant_shard_id into the tenant_id field.
This PR fixes it.
Refs
- fixes https://databricks.atlassian.net/browse/LKB-2930
Clap automatically uses doc comments as help/about texts. Doc comments
are strictly better, since they're also used e.g. for IDE documentation,
and are better formatted.
This patch updates all `neon_local` commands to use doc comments
(courtesy of GPT-o3).
## Problem
I discovered two bugs corresponding to safekeeper migration, which
together might lead to a data loss during the migration. The second bug
is from a hadron patch and might lead to a data loss during the
safekeeper restore in hadron as well.
1. `switch_membership` returns the current `term` instead of
`last_log_term`. It is used to choose the `sync_position` in the
algorithm, so we might choose the wrong one and break the correctness
guarantees.
2. The current `term` is used to choose the most advanced SK in
`pull_timeline` with higher priority than `flush_lsn`. It is incorrect
because the most advanced safekeeper is the one with the highest
`(last_log_term, flush_lsn)` pair. The compute might bump term on the
least advanced sk, making it the best choice to pull from, and thus
making committed log entries "uncommitted" after `pull_timeline`
Part of https://databricks.atlassian.net/browse/LKB-1017
## Summary of changes
- Return `last_log_term` in `switch_membership`
- Use `(last_log_term, flush_lsn)` as a primary key for choosing the
most advanced sk in `pull_timeline` and deny pulling if the `max_term`
is higher than on the most advanced sk (hadron only)
- Write tests for both cases
- Retry `sync_safekeepers` in `compute_ctl`
- Take into the account the quorum size when calculating `sync_position`
## Problem
The deletion logic had become difficult to understand and maintain.
## Summary of changes
- Added an RFC detailing proposed improvements to all deletion-related
APIs.
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
Add DELETE /lfc/prewarm route which handles ongoing prewarm
cancellation, update API spec, add prewarm Cancelled state
Add offload Cancelled state when LFC is not initialized
These lines are a significant fraction of the total log size of the
regression tests. And it seems very uninteresting, it's always
'localhost' in local tests.
## Problem
In #12268, we added Pageserver gRPC addresses to the storage controller.
However, we didn't actually persist these in the database.
## Summary of changes
Update the database with the new gRPC address on reattach.