Commit
```
commit 472cc17b7a
Author: Dmitry Rodionov <dmitry@neon.tech>
Date: Thu Jun 15 17:30:12 2023 +0300
propagate lock guard to background deletion task (#4495)
```
did a drive-by fix, but, the drive-by had a typo.
```
gc_loop{tenant_id=2e2f2bff091b258ac22a4c4dd39bd25d}:update_gc_info{timline_id=837c688fd37c903639b9aa0a6dd3f1f1}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000024DA0D1-000000000443FB51}:panic{thread=background op worker location=pageserver/src/tenant/timeline.rs:4843:25}: missing extractors: ["TimelineId"]
Stack backtrace:
0: utils::logging::tracing_panic_hook
at /libs/utils/src/logging.rs:166:21
1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/alloc/src/boxed.rs:2002:9
2: std::panicking::rust_panic_with_hook
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:692:13
3: std::panicking::begin_panic_handler::{{closure}}
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:579:13
4: std::sys_common::backtrace::__rust_end_short_backtrace
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/sys_common/backtrace.rs:137:18
5: rust_begin_unwind
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
6: core::panicking::panic_fmt
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
7: pageserver::tenant::timeline::debug_assert_current_span_has_tenant_and_timeline_id
at /pageserver/src/tenant/timeline.rs:4843:25
8: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0}
at /pageserver/src/tenant/timeline.rs:4368:9
9: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0}> as core::future::future::Future>::poll
at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
10: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}
at /pageserver/src/tenant/timeline.rs:4363:5
11: <pageserver::tenant::timeline::Timeline>::get_reconstruct_data::{closure#0}
at /pageserver/src/tenant/timeline.rs:2618:69
12: <pageserver::tenant::timeline::Timeline>::get::{closure#0}
at /pageserver/src/tenant/timeline.rs:565:13
13: <pageserver::tenant::timeline::Timeline>::list_slru_segments::{closure#0}
at /pageserver/src/pgdatadir_mapping.rs:427:42
14: <pageserver::tenant::timeline::Timeline>::is_latest_commit_timestamp_ge_than::{closure#0}
at /pageserver/src/pgdatadir_mapping.rs:390:13
15: <pageserver::tenant::timeline::Timeline>::find_lsn_for_timestamp::{closure#0}
at /pageserver/src/pgdatadir_mapping.rs:338:17
16: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0}
at /pageserver/src/tenant/timeline.rs:3967:71
17: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0}> as core::future::future::Future>::poll
at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
18: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}
at /pageserver/src/tenant/timeline.rs:3948:5
19: <pageserver::tenant::Tenant>::refresh_gc_info_internal::{closure#0}
at /pageserver/src/tenant.rs:2687:21
20: <pageserver::tenant::Tenant>::gc_iteration_internal::{closure#0}
at /pageserver/src/tenant.rs:2551:13
21: <pageserver::tenant::Tenant>::gc_iteration::{closure#0}
at /pageserver/src/tenant.rs:1490:13
22: pageserver::tenant::tasks::gc_loop::{closure#0}::{closure#0}
at /pageserver/src/tenant/tasks.rs:187:21
23: pageserver::tenant::tasks::gc_loop::{closure#0}
at /pageserver/src/tenant/tasks.rs:208:5
```
## Problem
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
During timeline creation we create special mark file which presense
indicates that initialization didnt complete successfully. In case of a
crash restart we can remove such half-initialized timeline and following
retry from control plane side should perform another attempt.
So in case of a possible crash restart during initial loading we have
following picture:
```
timelines
| - <timeline_id>___uninit
| - <timeline_id>
| - | <timeline files>
```
We call `std::fs::read_dir` to walk files in `timelines` directory one
by one. If we see uninit file
we proceed with deletion of both, timeline directory and uninit file. If
we see timeline we check if uninit file exists and do the same cleanup.
But in fact its possible to get both branches to be true at the same
time. Result of readdir doesnt reflect following directory state
modifications. So you can still get "valid" entry on the next iteration
of the loop despite the fact that it was deleted in one of the previous
iterations of the loop.
To see that you can apply the following patch (it disables uninit mark
cleanup on successful timeline creation):
```diff
diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 4beb2664..b3cdad8f 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -224,11 +224,6 @@ impl UninitializedTimeline<'_> {
)
})?;
}
- uninit_mark.remove_uninit_mark().with_context(|| {
- format!(
- "Failed to remove uninit mark file for timeline {tenant_id}/{timeline_id}"
- )
- })?;
v.insert(Arc::clone(&new_timeline));
new_timeline.maybe_spawn_flush_loop();
```
And perform the following steps:
```bash
neon_local init
neon_local start
neon_local tenant create
neon_local stop
neon_local start
```
The error is:
```log
INFO load{tenant_id=X}:blocking: Found an uninit mark file .neon/tenants/X/timelines/Y.___uninit, removing the timeline and its uninit mark
2023-06-09T18:43:41.664247Z ERROR load{tenant_id=X}: load failed, setting tenant state to Broken: failed to load metadata
Caused by:
0: Failed to read metadata bytes from path .neon/tenants/X/timelines/Y/metadata
1: No such file or directory (os error 2)
```
So uninit mark got deleted together with timeline directory but we still
got directory entry for it and tried to load it.
The bug prevented tenant from being successfully loaded.
## Summary of changes
Ideally I think we shouldnt place uninit marks in the same directory as timeline directories but move them to separate directory and
gather them as an input to actual listing, but that would be sort of an
on-disk format change, so just check whether entries are still valid
before operating on them.