From 9d4e3ac27f2fc009c6ce469e49c58c950a4e79ed Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 3 May 2023 17:01:05 +0300 Subject: [PATCH] Fix LSN in keepalive messages, if no WAL has been sent yet When a new connection is established to the safekeeper, the 'end_pos' field is initially set to Lsn::INVALID (i.e 0/0). If there is no WAL to send to the client, we send KeepAlive messages with Lsn::INVALID. That confuses the pageserver: it thinks that safekeeper is lagging very much behind the tip of the branch, and will reconnect to a different safekeeper. Then the same thing happens with the new safekeeper, until some WAL is streamed which sets 'end_pos' to a valid value. To fix, use 'start_pos' rather than 'end_pos' in the keepalive messages. When the safekeeper has sent all the WAL it has available, they are equal. When the safekeeper has some WAL to send, it will send an XLogData message rather than KeepAlive. If it did send a KeepAlive even when there was some WAL to send too, I think 'start_pos' was a more correct value anyway. Fixes https://github.com/neondatabase/neon/issues/3972 --- safekeeper/src/send_wal.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/safekeeper/src/send_wal.rs b/safekeeper/src/send_wal.rs index 6b303eb0fe..10fe0202cd 100644 --- a/safekeeper/src/send_wal.rs +++ b/safekeeper/src/send_wal.rs @@ -551,7 +551,7 @@ impl WalSender<'_, IO> { self.pgb .write_message(&BeMessage::KeepAlive(WalSndKeepAlive { - sent_ptr: self.end_pos.0, + sent_ptr: self.start_pos.0, timestamp: get_current_timestamp(), request_reply: true, }))