Files
neon/libs
John G. Crowley 6a35a3e9f1 HCC, resolved GCS upload permit deadlock, SK generation delete bug-fix. (#12873)
## Problem

**HCC SafeKeepers**
* Currently, the `hcc_base_url` flag is set to `None`, disabling
automatic timeline pull from other SafeKeepers on restart. We can
manually call `pull_timeline` but would prefer to use the Hadron
functionality.

**GCS Sempahore Permit Deadlock on Upload**
* GCS `upload` trait implementation's call of `put_object` is
duplicating semaphore permit acquisition, creating deadlock. Each
`upload` acquires, calls `put_object`, nothing to acquire, times out,
retries, etc.

**Storage Controller delete API for SafeKeepers Bug**
* Noticed this while doing a PITR and reusing an old Timeline ID (that
had been previously deleted).
* `DELETE` timeline endpoint in Storage Controller fails to delete the
TL due to generation number mismatch between the [Pending
Op](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_service.rs#L565)
(gen = `i32::MAX`) and the [Schedule
Request](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_service.rs#L582)
(gen = SK.generation). The extant Pending Op [blocks the
deletion](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_reconciler.rs#L462)
of the database record for the TL on [condition that the request
generation == the pending op
generation](https://github.com/neondatabase/neon/blob/main/storage_controller/src/persistence.rs#L1844),
which only happens when the Storage Controller is restarted, [where
pending operations are
reloaded](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_reconciler.rs#L162),
and the request generation is set to `i32::MAX`.
* If the same Timeline ID is used later after its `DELETE`, the old
`start_lsn` value therefore remains in the Storage Controller database
and will cause Compute's WalProposer to crashloop as it thinks it starts
from the prior timeline's (of that same ID's) LSN.

## Summary of changes
* Activate `hcc_base_url` (Hadron) argument for SafeKeeper binary to
enable automatic timeline pull from other SafeKeepers on start.
* Remove the nested permit acquisition and timeout wrapper from
`put_object` in GCS client, as `put_object` is only called by the
`upload` trait implementation.
* Set the Pending Op generation number to SafeKeeper gen to allow
timeline deletion from Storage Controller database without having to
bounce a Storage Controller pod.
2026-03-25 09:27:03 +01:00
..
2025-07-22 09:31:39 +00:00
2025-07-22 09:31:39 +00:00