neon/safekeeper at 832e60c2b4fe700ba703cea3fb0740a37abeb39a - neon

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Files

Arthur Petukhovsky 976576ae59 Fix walreceiver and safekeeper bugs (#2295 )

- There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver.
- IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver.
- There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic.
- There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous.
- There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts.
- Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors.

New retry logic works like this:
- Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment
- When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect.
- When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.

2022-08-18 13:38:23 +03:00

spec

Fix typos

2022-05-28 14:02:05 +03:00

src

Fix walreceiver and safekeeper bugs (#2295 )

2022-08-18 13:38:23 +03:00

Cargo.toml

refactor: replace lazy-static with once-cell (#2195 )

2022-08-05 19:34:04 +02:00