Fix LSN in keepalive messages, if no WAL has been sent yet

When a new connection is established to the safekeeper, the 'end_pos'
field is initially set to Lsn::INVALID (i.e 0/0). If there is no WAL
to send to the client, we send KeepAlive messages with
Lsn::INVALID. That confuses the pageserver: it thinks that safekeeper
is lagging very much behind the tip of the branch, and will reconnect
to a different safekeeper. Then the same thing happens with the new
safekeeper, until some WAL is streamed which sets 'end_pos' to a valid
value.

To fix, use 'start_pos' rather than 'end_pos' in the keepalive
messages. When the safekeeper has sent all the WAL it has available,
they are equal. When the safekeeper has some WAL to send, it will send
an XLogData message rather than KeepAlive. If it did send a KeepAlive
even when there was some WAL to send too, I think 'start_pos' was a
more correct value anyway.

Fixes https://github.com/neondatabase/neon/issues/3972
This commit is contained in:
Heikki Linnakangas
2023-05-03 17:01:05 +03:00
parent 39ca7c7c09
commit 9d4e3ac27f

View File

@@ -551,7 +551,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> WalSender<'_, IO> {
self.pgb
.write_message(&BeMessage::KeepAlive(WalSndKeepAlive {
sent_ptr: self.end_pos.0,
sent_ptr: self.start_pos.0,
timestamp: get_current_timestamp(),
request_reply: true,
}))