Compare commits

..

1 Commits

Author SHA1 Message Date
Heikki Linnakangas
62b1e07b0f Consume fewer XIDs when restarting primary
The pageserver tracks the latest XID seen in the WAL, in the nextXid
field in the "checkpoint" key-value pair. To reduce the churn on that
single storage key, it's not tracked exactly. Rather, when we advance
it, we always advance it to the next multiple of 1024 XIDs. That way,
we only need to insert a new checkpoint value to the storage every
1024 transactions.

However, read-only replicas now scan the WAL at startup, to find any
XIDs that haven't been explicitly aborted or committed, and treats
them as still in-progress (PR #7288). When we bump up the nextXid
counter by 1024, all those skipped XID look like in-progress XIDs to a
read replica. There's a limited amount of space for tracking
in-progress XIDs, so there's more cost ot skipping XIDs now. We had a
case in production where a read replica did not start up, because the
primary had gone through many restart cycles without writing any
running-xacts or checkpoint WAL records, and each restart added almost
1024 "orphaned" XIDs that had to be tracked as in-progress in the
replica. As soon as the primary writes a running-xacts or checkpoint
record, the orphaned XIDs can be removed from the in-progress XIDs
list and hte problem resolves, but if those recors are not written,
the orphaned XIDs just accumulate.

We should work harder to make sure that a running-xacts or checkpoint
record is written at primary startup or shutdown. But at the same
time, we can just make XID_CHECKPOINT_INTERVAL smaller, to consume
fewer XIDs in such scenarios. That means that we will generate more
versions of the checkpoint key-value pair in the storage, but we
haven't seen any problems with that so it's probably fine to go from
1024 to 128.
2024-07-05 19:33:42 +03:00
2 changed files with 23 additions and 10 deletions

View File

@@ -48,6 +48,15 @@ pub const XLOG_SIZE_OF_XLOG_RECORD: usize = std::mem::size_of::<XLogRecord>();
#[allow(clippy::identity_op)]
pub const SIZE_OF_XLOG_RECORD_DATA_HEADER_SHORT: usize = 1 * 2;
/// Interval of checkpointing metadata file. We should store metadata file to enforce
/// predicate that checkpoint.nextXid is larger than any XID in WAL.
/// But flushing checkpoint file for each transaction seems to be too expensive,
/// so XID_CHECKPOINT_INTERVAL is used to forward align nextXid and so perform
/// metadata checkpoint only once per XID_CHECKPOINT_INTERVAL transactions.
/// XID_CHECKPOINT_INTERVAL should not be larger than BLCKSZ*CLOG_XACTS_PER_BYTE
/// in order to let CLOG_TRUNCATE mechanism correctly extend CLOG.
const XID_CHECKPOINT_INTERVAL: u32 = 128;
pub fn XLogSegmentsPerXLogId(wal_segsz_bytes: usize) -> XLogSegNo {
(0x100000000u64 / wal_segsz_bytes as u64) as XLogSegNo
}
@@ -322,10 +331,14 @@ impl CheckPoint {
/// Returns 'true' if the XID was updated.
pub fn update_next_xid(&mut self, xid: u32) -> bool {
// nextXid should be greater than any XID in WAL, so increment provided XID and check for wraparround.
let new_xid = std::cmp::max(
let mut new_xid = std::cmp::max(
xid.wrapping_add(1),
pg_constants::FIRST_NORMAL_TRANSACTION_ID,
);
// To reduce number of metadata checkpoints, we forward align XID on XID_CHECKPOINT_INTERVAL.
// XID_CHECKPOINT_INTERVAL should not be larger than BLCKSZ*CLOG_XACTS_PER_BYTE
new_xid =
new_xid.wrapping_add(XID_CHECKPOINT_INTERVAL - 1) & !(XID_CHECKPOINT_INTERVAL - 1);
let full_xid = self.nextXid.value;
let old_xid = full_xid as u32;
if new_xid.wrapping_sub(old_xid) as i32 > 0 {
@@ -347,7 +360,7 @@ impl CheckPoint {
/// Advance next multi-XID/offset to those given in arguments.
///
/// It's important that this handles wraparound correctly. This should match the
/// MultiXactAdvceNextMXact() logic in PostgreSQL's xlog_redo() function.
/// MultiXactAdvanceNextMXact() logic in PostgreSQL's xlog_redo() function.
///
/// Returns 'true' if the Checkpoint was updated.
pub fn update_next_multixid(&mut self, multi_xid: u32, multi_offset: u32) -> bool {

View File

@@ -187,19 +187,19 @@ pub fn test_update_next_xid() {
// The input XID gets rounded up to the next XID_CHECKPOINT_INTERVAL
// boundary
checkpoint.update_next_xid(100);
assert_eq!(checkpoint.nextXid.value, 1024);
assert_eq!(checkpoint.nextXid.value, 128);
// No change
checkpoint.update_next_xid(500);
assert_eq!(checkpoint.nextXid.value, 1024);
checkpoint.update_next_xid(1023);
assert_eq!(checkpoint.nextXid.value, 1024);
checkpoint.update_next_xid(100);
assert_eq!(checkpoint.nextXid.value, 128);
checkpoint.update_next_xid(127);
assert_eq!(checkpoint.nextXid.value, 128);
// The function returns the *next* XID, given the highest XID seen so
// far. So when we pass 1024, the nextXid gets bumped up to the next
// far. So when we pass 128, the nextXid gets bumped up to the next
// XID_CHECKPOINT_INTERVAL boundary.
checkpoint.update_next_xid(1024);
assert_eq!(checkpoint.nextXid.value, 2048);
checkpoint.update_next_xid(128);
assert_eq!(checkpoint.nextXid.value, 256);
}
#[test]