mirror of
https://github.com/neondatabase/neon.git
synced 2026-06-01 12:30:38 +00:00
The Problem
-----------
Before this patch, the following could happen.
* read_dir().next() returns the unint mark entry, we delete the timeline dir and the mark
* read_dir().next() returns the timeline dir entry
* this is totally normal, a directory iterator is not invalidated by directory modifiction
* we see that there's no uninit mark
* so we try to load the timeline
* but actually, the timeline dir is gone, so we fail the load with an error like
```
2023-06-09T18:43:41.664247Z ERROR load{tenant_id=X}: load failed, setting tenant state to Broken: failed to load metadata
Caused by:
0: Failed to read metadata bytes from path .neon/tenants/X/timelines/Y/metadata
1: No such file or directory (os error 2)
```
The Fix
-------
Turn the purging of temp entries into a fix-point iteration that restarts after every purge.
Expressive, but less efficient. I'm ok with the inefficiency until it becomes a problem.
After fix-point iteration, do read-only iteration where we expect there to only
be dir entries that are valid timeline dirs.
I also took the liberty to drive-by fix two issues that have been bugging me
for some time:
1. precise error when extracting TimelineId from uninit mark file
refs https://github.com/neondatabase/neon/issues/3488
2. Bail out instead of WARN-logging if there is directory entry in the
timelines dir that is not a valid TimelineId.
Bailing out means that the tenant will fail to load, i.e., it will be `Broken`.
The situation can be fixed by operator using ignore+fix+load.
Before this patch, we would just log a warning and continue.
In my opinion, being strict about this is the better choice, because,
if we somehow miss a timeline's existence, we make incorrect GC decisions.