mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-09 22:42:57 +00:00
## Problem Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1720511577862519 We're seeing OOMs in staging on a pageserver that has l0_flush.mode=Direct enabled. There's a strong correlation between jumps in `maxrss_kb` and `pageserver_timeline_ephemeral_bytes`, so, it's quite likely that l0_flush.mode=Direct is the culprit. Notably, the expected max memory usage on that staging server by the l0_flush.mode=Direct is ~2GiB but we're seeing as much as 24GiB max RSS before the OOM kill. One hypothesis is that we're dropping the semaphore permit before all the dirtied pages have been flushed to disk. (The flushing to disk likely happens in the fsync inside the `.finish()` call, because we're using ext4 in data=ordered mode). ## Summary of changes Hold the permit until after we're done with `.finish()`.