mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-14 03:30:36 +00:00
## Problem **HCC SafeKeepers** * Currently, the `hcc_base_url` flag is set to `None`, disabling automatic timeline pull from other SafeKeepers on restart. We can manually call `pull_timeline` but would prefer to use the Hadron functionality. **GCS Sempahore Permit Deadlock on Upload** * GCS `upload` trait implementation's call of `put_object` is duplicating semaphore permit acquisition, creating deadlock. Each `upload` acquires, calls `put_object`, nothing to acquire, times out, retries, etc. **Storage Controller delete API for SafeKeepers Bug** * Noticed this while doing a PITR and reusing an old Timeline ID (that had been previously deleted). * `DELETE` timeline endpoint in Storage Controller fails to delete the TL due to generation number mismatch between the [Pending Op](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_service.rs#L565) (gen = `i32::MAX`) and the [Schedule Request](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_service.rs#L582) (gen = SK.generation). The extant Pending Op [blocks the deletion](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_reconciler.rs#L462) of the database record for the TL on [condition that the request generation == the pending op generation](https://github.com/neondatabase/neon/blob/main/storage_controller/src/persistence.rs#L1844), which only happens when the Storage Controller is restarted, [where pending operations are reloaded](https://github.com/neondatabase/neon/blob/main/storage_controller/src/service/safekeeper_reconciler.rs#L162), and the request generation is set to `i32::MAX`. * If the same Timeline ID is used later after its `DELETE`, the old `start_lsn` value therefore remains in the Storage Controller database and will cause Compute's WalProposer to crashloop as it thinks it starts from the prior timeline's (of that same ID's) LSN. ## Summary of changes * Activate `hcc_base_url` (Hadron) argument for SafeKeeper binary to enable automatic timeline pull from other SafeKeepers on start. * Remove the nested permit acquisition and timeout wrapper from `put_object` in GCS client, as `put_object` is only called by the `upload` trait implementation. * Set the Pending Op generation number to SafeKeeper gen to allow timeline deletion from Storage Controller database without having to bounce a Storage Controller pod.