neon/storage_controller at 6f7e3c18e4536dba40d2deb2f2100a3fe54baffd - neon

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-26 15:49:58 +00:00

Files

Vlad Lazar 6f7e3c18e4 storage_controller: make leadership protocol more robust (#11703 )

## Problem

We saw the following scenario in staging:
1. Pod A starts up. Becomes leader and steps down the previous pod
cleanly.
2. Pod B starts up (deployment).
3. Step down request from pod B to pod A times out. Pod A did not manage
to stop its reconciliations within 10 seconds and exited with return
code 1
([code](7ba8519b43/storage_controller/src/service.rs (L8686-L8702))).
4. Pod B marks itself as the leader and finishes start-up
5. k8s restarts pod A
6. k8s marks pod B as ready
7. pod A sends step down request to pod A - this succeeds => pod A is
now the leader
8. k8s kills pod A because it thinks pod B is healthy and pod A is part
of the old replica set

We end up in a situation where the only pod we have (B) is stepped down
and attempts to forward requests to a leader that doesn't exist. k8s
can't detect that pod B is in a bad state since the /status endpoint
simply returns 200 hundred if the pod is running.

## Summary of changes

This PR includes a number of robustness improvements to the leadership
protocol:
* use a single step down task per controller
* add a new endpoint to be used as k8s liveness probe and check
leadership status there
* handle restarts explicitly (i.e. don't step yourself down)
* increase the step down retry count
* don't kill the process on long step down since k8s will just restart
it

2025-04-24 16:59:56 +00:00

client

storcon + safekeeper + scrubber: propagate root CA certs everywhere (#11418 )

2025-04-04 06:30:48 +00:00

migrations

storage_controller: coordinate imports across shards in the storage controller (#11345 )

2025-04-24 11:26:06 +00:00

src

storage_controller: make leadership protocol more robust (#11703 )

2025-04-24 16:59:56 +00:00

Cargo.toml

storcon: add https API (#11239 )

2025-03-20 08:22:02 +00:00