Allure does not support ansi colored logs, yet `compute_ctl` has them.
Upgrade criterion to get rid of atty dependency, disable ansi colors,
remove atty dependency and disable ansi feature of tracing-subscriber.
This is a heavy-handed approach. I am not aware of a workflow where
you'd want to connect a terminal directly to for example `compute_ctl`,
usually you find the logs in a file. If someone had been using colors,
they will now need to:
- turn the `tracing-subscriber.default-features` to `true`
- edit their wanted project to have colors
I decided to explicitly disable ansi colors in case we would have in
future a dependency accidentally enabling the feature on
`tracing-subscriber`, which would be quite surprising but not
unimagineable.
By getting rid of `atty` from dependencies we get rid of
<https://github.com/advisories/GHSA-g98v-hv3f-hcfr>.
## Problem
All tests have already been parametrised by Postgres version and build
type (to have them distinguishable in the Allure report), but despite
it, it's anyway required to have DEFAULT_PG_VERSION and BUILD_TYPE env
vars set to corresponding values, for example to
run`test_timeline_deletion_with_files_stuck_in_upload_queue[release-pg14-local_fs]`
test it's required to set `DEFAULT_PG_VERSION=14` and
`BUILD_TYPE=release`.
This PR makes the test framework pick up parameters from the test name
itself.
## Summary of changes
- Postgres version and build type related fixtures now are
function-scoped (instead of being sessions scoped before)
- Deprecate `--pg-version` argument in favour of DEFAULT_PG_VERSION env
variable (it's easier to parse)
- GitHub autocomment now includes only one command with all the failed
tests + runs them in parallel
## Problem
#4528
## Summary of changes
Add a 60 seconds default timeout to the reqwest client
Add retries for up to 3 times to call into the metric consumption
endpoint
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Prod logs have deep accidential span nesting. Introduced in #3759 and
has been untouched since, maybe no one watches proxy logs? :) I found it
by accident when looking to see if proxy logs have ansi colors with
`{neon_service="proxy"}`.
The solution is to mostly stop using `Span::enter` or `Span::entered` in
async code. Kept on `Span::entered` in cancel on shutdown related path.
## Problem
Message "set local file cache limit to..." polutes compute logs.
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/4392, continuation
of https://github.com/neondatabase/neon/pull/4408
## Summary of changes
This PR removes all layer objects from LayerMap and moves it to the
timeline struct. In timeline struct, LayerFileManager maps a layer
descriptor to a layer object, and it is stored in the same RwLock as
LayerMap to avoid behavior difference.
Key changes:
* LayerMap now does not have generic, and only stores descriptors.
* In Timeline, we add a new struct called layer mapping.
* Currently, layer mapping is stored in the same lock with layer map.
Every time we retrieve data from the layer map, we will need to map the
descriptor to the actual object.
* Replace_historic is moved to layer mapping's replace, and the return
value behavior is different from before. I'm a little bit unsure about
this part and it would be good to have some comments on that.
* Some test cases are rewritten to adapt to the new interface, and we
can decide whether to remove it in the future because it does not make
much sense now.
* LayerDescriptor is moved to `tests` module and should only be intended
for unit testing / benchmarks.
* Because we now have a usage pattern like "take the guard of lock, then
get the reference of two fields", we want to avoid dropping the
incorrect object when we intend to unlock the lock guard. Therefore, a
new set of helper function `drop_r/wlock` is added. This can be removed
in the future when we finish the refactor.
TODOs after this PR: fully remove RemoteLayer, and move LayerMapping to
a separate LayerCache.
all refactor PRs:
```
#4437 --- #4479 ------------ #4510 (refactor done at this point)
\-- #4455 -- #4502 --/
```
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
There is a magic number about how often we repartition and therefore
affecting how often we compact. This PR makes this number `10` a global
constant and add docs.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We want to have a number of custom extensions that will not be available
by default, as an example of such is [Postgres
Anonymizer](https://postgresql-anonymizer.readthedocs.io/en/stable/)
(please note that the extension should be added to
`shared_preload_libraries`). To distinguish them, custom extensions
should be added to a different S3 path:
```
s3://<bucket>/<release version>/<postgres_version>/<ext_name>/share/extensions/
s3://<bucket>/<release version>/<postgres_version>/<ext_name>/lib
where <ext_name> is an extension name
```
Resolves https://github.com/neondatabase/neon/issues/4582
## Summary of changes
- Add Postgres Anonymizer extension to Dockerfile (it's included only to
postgres-extensions target)
- Build extensions image from postgres-extensions target in a workflow
- Upload custom extensions to S3 (different directory)
When we use local SSD for bench and create `.neon` directory before we
do `cargo neon init`, the initialization process will error due to
directory already exists. This PR adds a flag `--force` that removes
everything inside the directory if `.neon` already exists.
---------
Signed-off-by: Alex Chi Z. <chi@neon.tech>
## Problem
See #4516
Inspecting log it is possible to notice that if lwlsn is set to the
beginning of applied WAL record, then
incorrect version of the page is loaded:
```
2023-06-27 18:36:51.930 GMT [3273945] CONTEXT: WAL redo at 0/14AF6F0 for Heap/INSERT: off 2 flags 0x01; blkref #0: rel 1663/5/1259, blk 0 FPW
2023-06-27 18:36:51.930 GMT [3273945] LOG: Do REDO block 0 of rel 1663/5/1259 fork 0 at LSN 0/**014AF6F0**..0/014AFA60
2023-06-27 18:37:02.173 GMT [3273963] LOG: Read blk 0 in rel 1663/5/1259 fork 0 (request LSN 0/**014AF6F0**): lsn=0/**0143C7F8** at character 22
2023-06-27 18:37:47.780 GMT [3273945] LOG: apply WAL record at 0/1BB8F38 xl_tot_len=188, xl_prev=0/1BB8EF8
2023-06-27 18:37:47.780 GMT [3273945] CONTEXT: WAL redo at 0/1BB8F38 for Heap/INPLACE: off 2; blkref #0: rel 1663/5/1259, blk 0
2023-06-27 18:37:47.780 GMT [3273945] LOG: Do REDO block 0 of rel 1663/5/1259 fork 0 at LSN 0/01BB8F38..0/01BB8FF8
2023-06-27 18:37:47.780 GMT [3273945] CONTEXT: WAL redo at 0/1BB8F38 for Heap/INPLACE: off 2; blkref #0: rel 1663/5/1259, blk 0
2023-06-27 18:37:47.780 GMT [3273945] PANIC: invalid lp
```
## Summary of changes
1. Use end record LSN for both cases
2. Update lwlsn for relation metadata
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
We want to store Postgres Extensions in S3 (resolves
https://github.com/neondatabase/neon/issues/4493).
Proposed solution:
- Create a separate docker image (from scratch) that contains only
extensions
- Do not include extensions into compute-node (except for neon
extensions)*
- For release and main builds upload extract extension from the image
and upload to S3 (`s3://<bucket>/<release
version>/<postgres_version>/...`)**
*) We're not doing it until the feature is not fully implemented
**) This differs from the initial proposal in
https://github.com/neondatabase/neon/issues/4493 of putting extensions
straight into `s3://<bucket>/<postgres_version>/...`, because we can't
upload directory atomicly. A drawback of this is that we end up with
unnecessary copies of files ~2.1 GB per release (i.e. +2.1 GB for each
commit in main also). We don't really need to update extensions for each
release if there're no relevant changes, but this requires extra work.
## Summary of changes
- Created a separate stage in Dockerfile.compute-node
`postgres-extensions` that contains only extensions
- Added a separate step in a workflow that builds `postgres-extensions`
image (because of a bug in kaniko this step is commented out because it
takes way too long to get built)
- Extract extensions from the image and upload files to S3 for release
and main builds
- Upload extenstions only for staging (for now)
## Problem
Lets keep 500 for unusual stuff that is not considered normal. Came up
during one of the discussions around console logs now seeing this 500's.
## Summary of changes
- Return 409 Conflict instead of 500
- Remove 200 ok status because it is not used anymore
This PR concludes the "async `Layer::get_value_reconstruct_data`"
project.
The problem we're solving is that, before this patch, we'd execute
`Layer::get_value_reconstruct_data` on the tokio executor threads.
This function is IO- and/or CPU-intensive.
The IO is using VirtualFile / std::fs; hence it's blocking.
This results in unfairness towards other tokio tasks, especially under
(disk) load.
Some context can be found at
https://github.com/neondatabase/neon/issues/4154
where I suspect (but can't prove) load spikes of logical size
calculation to
cause heavy eviction skew.
Sadly we don't have tokio runtime/scheduler metrics to quantify the
unfairness.
But generally, we know blocking the executor threads on std::fs IO is
bad.
So, let's have this change and watch out for severe perf regressions in
staging & during rollout.
## Changes
* rename `Layer::get_value_reconstruct_data` to
`Layer::get_value_reconstruct_data_blocking`
* add a new blanket impl'd `Layer::get_value_reconstruct_data`
`async_trait` method that runs `get_value_reconstruct_data_blocking`
inside `spawn_blocking`.
* The `spawn_blocking` requires `'static` lifetime of the captured
variables; hence I had to change the data flow to _move_ the
`ValueReconstructState` into and back out of get_value_reconstruct_data
instead of passing a reference. It's a small struct, so I don't expect a
big performance penalty.
## Performance
Fundamentally, the code changes cause the following performance-relevant
changes:
* Latency & allocations: each `get_value_reconstruct_data` call now
makes a short-lived allocation because `async_trait` is just sugar for
boxed futures under the hood
* Latency: `spawn_blocking` adds some latency because it needs to move
the work to a thread pool
* using `spawn_blocking` plus the existing synchronous code inside is
probably more efficient better than switching all the synchronous code
to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under
the hood.
* Throughput: the `spawn_blocking` thread pool is much larger than the
async executor thread pool. Hence, as long as the disks can keep up,
which they should according to AWS specs, we will be able to deliver
higher `get_value_reconstruct_data` throughput.
* Disk IOPS utilization: we will see higher disk utilization if we get
more throughput. Not a problem because the disks in prod are currently
under-utilized, according to node_exporter metrics & the AWS specs.
* CPU utilization: at higher throughput, CPU utilization will be higher.
Slightly higher latency under regular load is acceptable given the
throughput gains and expected better fairness during disk load peaks,
such as logical size calculation peaks uncovered in #4154.
## Full Stack Of Preliminary PRs
This PR builds on top of the following preliminary PRs
1. Clean-ups
* https://github.com/neondatabase/neon/pull/4316
* https://github.com/neondatabase/neon/pull/4317
* https://github.com/neondatabase/neon/pull/4318
* https://github.com/neondatabase/neon/pull/4319
* https://github.com/neondatabase/neon/pull/4321
* Note: these were mostly to find an alternative to #4291, which I
thought we'd need in my original plan where we would need to convert
`Tenant::timelines` into an async locking primitive (#4333). In reviews,
we walked away from that, but these cleanups were still quite useful.
2. https://github.com/neondatabase/neon/pull/4364
3. https://github.com/neondatabase/neon/pull/4472
4. https://github.com/neondatabase/neon/pull/4476
5. https://github.com/neondatabase/neon/pull/4477
6. https://github.com/neondatabase/neon/pull/4485
7. https://github.com/neondatabase/neon/pull/4441
The stats for `compact_level0_phase` that I added in #4527 show the
following breakdown (24h data from prod, only looking at compactions
with > 1 L1 produced):
* 10%ish of wall-clock time spent between the two read locks
* I learned that the `DeltaLayer::iter()` and `DeltaLayer::key_iter()`
calls actually do IO, even before we call `.next()`. I suspect that is
why they take so much time between the locks.
* 80+% of wall-clock time spent writing layer files
* Lock acquisition time is irrelevant (low double-digit microseconds at
most)
* The generation of the holes holds the read lock for a relatively long
time and it's proportional to the amount of keys / IO required to
iterate over them (max: 110ms in prod; staging (nightly benchmarks):
multiple seconds).
Find below screenshots from my ad-hoc spreadsheet + some graphs.
<img width="1182" alt="image"
src="https://github.com/neondatabase/neon/assets/956573/81398b3f-6fa1-40dd-9887-46a4715d9194">
<img width="901" alt="image"
src="https://github.com/neondatabase/neon/assets/956573/e4ac0393-f2c1-4187-a5e5-39a8b0c394c9">
<img width="210" alt="image"
src="https://github.com/neondatabase/neon/assets/956573/7977ade7-6aa5-4773-a0a2-f9729aecee0d">
## Changes In This PR
This PR makes the following changes:
* rearrange the `compact_level0_phase1` code such that we build the
`all_keys_iter` and `all_values_iter` later than before
* only grab the `Timeline::layers` lock once, and hold it until we've
computed the holes
* run compact_level0_phase1 in spawn_blocking, pre-grabbing the
`Timeline::layers` lock in the async code and passing it in as an
`OwnedRwLockReadGuard`.
* the code inside spawn_blocking drops this guard after computing the
holds
* the `OwnedRwLockReadGuard` requires the `Timeline::layers` to be
wrapped in an `Arc`. I think that's Ok, the locking for the RwLock is
more heavy-weight than an additional pointer indirection.
## Alternatives Considered
The naive alternative is to throw the entire function into
`spawn_blocking`, and use `blocking_read` for `Timeline::layers` access.
What I've done in this PR is better because, with this alternative,
1. while we `blocking_read()`, we'd waste one slot in the spawn_blocking
pool
2. there's deadlock risk because the spawn_blocking pool is a finite
resource

## Metadata
Fixes https://github.com/neondatabase/neon/issues/4492
## Problem
Currently, if a user creates a role, it won't by default have any grants
applied to it. If the compute restarts, the grants get applied. This
gives a very strange UX of being able to drop roles/not have any access
to anything at first, and then once something triggers a config
application, suddenly grants are applied. This removes these grants.
This is follow-up to
```
commit 2252c5c282
Author: Alex Chi Z <iskyzh@gmail.com>
Date: Wed Jun 14 17:12:34 2023 -0400
metrics: convert some metrics to pageserver-level (#4490)
```
The consumption metrics synthetic size worker does logical size
calculation. Logical size calculation currently does synchronous disk
IO. This blocks the MGMT_REQUEST_RUNTIME's executor threads, starving
other futures.
While there's work on the way to move the synchronous disk IO into
spawn_blocking, the quickfix here is to use the BACKGROUND_RUNTIME
instead of MGMT_REQUEST_RUNTIME.
Actually it's not just a quickfix. We simply shouldn't be blocking
MGMT_REQUEST_RUNTIME executor threads on CPU or sync disk IO.
That work isn't done yet, as many of the mgmt tasks still _do_ disk IO.
But it's not as intensive as the logical size calculations that we're
fixing here.
While we're at it, fix disk-usage-based eviction in a similar way. It
wasn't the culprit here, according to prod logs, but it can
theoretically be a little CPU-intensive.
More context, including graphs from Prod:
https://neondb.slack.com/archives/C03F5SM1N02/p1687541681336949
Doc says that it should be added into `shared_preload_libraries`, but,
practically, it's not required.
```
postgres=# create extension pg_uuidv7;
CREATE EXTENSION
postgres=# SELECT uuid_generate_v7();
uuid_generate_v7
--------------------------------------
0188e823-3f8f-796c-a92c-833b0b2d1746
(1 row)
```
The histogram distinguishes by ok/err.
I took the liberty to create a small abstraction for such use cases.
It helps keep the label values inside `metrics.rs`, right next
to the place where the metric and its labels are declared.
## Problem
A git tag for a release has an extra `release-` prefix (it looks like
`release-release-3439`).
## Summary of changes
- Do not add `release-` prefix when create git tag
## Problem
In the test environment vacuum duration fluctuates from ~1h to ~5h, along
with another two 1h benchmarks (`select-only` and `simple-update`) it
could be up to 7h which is longer than 6h timeout.
## Summary of changes
- Increase timeout for pgbench-compare job to 8h
- Remove 6h timeouts from Nightly Benchmarks (this is a default value)
* `compaction_threshold` should be an integer, not a string.
* uncomment `[section]` so that if a user needs to modify the config,
they can simply uncomment the corresponding line. Otherwise it's easy
for us to forget uncommenting the `[section]` when uncommenting the
config item we want to configure.
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Commit
```
commit 472cc17b7a
Author: Dmitry Rodionov <dmitry@neon.tech>
Date: Thu Jun 15 17:30:12 2023 +0300
propagate lock guard to background deletion task (#4495)
```
did a drive-by fix, but, the drive-by had a typo.
```
gc_loop{tenant_id=2e2f2bff091b258ac22a4c4dd39bd25d}:update_gc_info{timline_id=837c688fd37c903639b9aa0a6dd3f1f1}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000024DA0D1-000000000443FB51}:panic{thread=background op worker location=pageserver/src/tenant/timeline.rs:4843:25}: missing extractors: ["TimelineId"]
Stack backtrace:
0: utils::logging::tracing_panic_hook
at /libs/utils/src/logging.rs:166:21
1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/alloc/src/boxed.rs:2002:9
2: std::panicking::rust_panic_with_hook
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:692:13
3: std::panicking::begin_panic_handler::{{closure}}
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:579:13
4: std::sys_common::backtrace::__rust_end_short_backtrace
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/sys_common/backtrace.rs:137:18
5: rust_begin_unwind
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
6: core::panicking::panic_fmt
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
7: pageserver::tenant::timeline::debug_assert_current_span_has_tenant_and_timeline_id
at /pageserver/src/tenant/timeline.rs:4843:25
8: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0}
at /pageserver/src/tenant/timeline.rs:4368:9
9: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0}> as core::future::future::Future>::poll
at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
10: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}
at /pageserver/src/tenant/timeline.rs:4363:5
11: <pageserver::tenant::timeline::Timeline>::get_reconstruct_data::{closure#0}
at /pageserver/src/tenant/timeline.rs:2618:69
12: <pageserver::tenant::timeline::Timeline>::get::{closure#0}
at /pageserver/src/tenant/timeline.rs:565:13
13: <pageserver::tenant::timeline::Timeline>::list_slru_segments::{closure#0}
at /pageserver/src/pgdatadir_mapping.rs:427:42
14: <pageserver::tenant::timeline::Timeline>::is_latest_commit_timestamp_ge_than::{closure#0}
at /pageserver/src/pgdatadir_mapping.rs:390:13
15: <pageserver::tenant::timeline::Timeline>::find_lsn_for_timestamp::{closure#0}
at /pageserver/src/pgdatadir_mapping.rs:338:17
16: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0}
at /pageserver/src/tenant/timeline.rs:3967:71
17: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0}> as core::future::future::Future>::poll
at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
18: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}
at /pageserver/src/tenant/timeline.rs:3948:5
19: <pageserver::tenant::Tenant>::refresh_gc_info_internal::{closure#0}
at /pageserver/src/tenant.rs:2687:21
20: <pageserver::tenant::Tenant>::gc_iteration_internal::{closure#0}
at /pageserver/src/tenant.rs:2551:13
21: <pageserver::tenant::Tenant>::gc_iteration::{closure#0}
at /pageserver/src/tenant.rs:1490:13
22: pageserver::tenant::tasks::gc_loop::{closure#0}::{closure#0}
at /pageserver/src/tenant/tasks.rs:187:21
23: pageserver::tenant::tasks::gc_loop::{closure#0}
at /pageserver/src/tenant/tasks.rs:208:5
```
## Problem
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
During timeline creation we create special mark file which presense
indicates that initialization didnt complete successfully. In case of a
crash restart we can remove such half-initialized timeline and following
retry from control plane side should perform another attempt.
So in case of a possible crash restart during initial loading we have
following picture:
```
timelines
| - <timeline_id>___uninit
| - <timeline_id>
| - | <timeline files>
```
We call `std::fs::read_dir` to walk files in `timelines` directory one
by one. If we see uninit file
we proceed with deletion of both, timeline directory and uninit file. If
we see timeline we check if uninit file exists and do the same cleanup.
But in fact its possible to get both branches to be true at the same
time. Result of readdir doesnt reflect following directory state
modifications. So you can still get "valid" entry on the next iteration
of the loop despite the fact that it was deleted in one of the previous
iterations of the loop.
To see that you can apply the following patch (it disables uninit mark
cleanup on successful timeline creation):
```diff
diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 4beb2664..b3cdad8f 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -224,11 +224,6 @@ impl UninitializedTimeline<'_> {
)
})?;
}
- uninit_mark.remove_uninit_mark().with_context(|| {
- format!(
- "Failed to remove uninit mark file for timeline {tenant_id}/{timeline_id}"
- )
- })?;
v.insert(Arc::clone(&new_timeline));
new_timeline.maybe_spawn_flush_loop();
```
And perform the following steps:
```bash
neon_local init
neon_local start
neon_local tenant create
neon_local stop
neon_local start
```
The error is:
```log
INFO load{tenant_id=X}:blocking: Found an uninit mark file .neon/tenants/X/timelines/Y.___uninit, removing the timeline and its uninit mark
2023-06-09T18:43:41.664247Z ERROR load{tenant_id=X}: load failed, setting tenant state to Broken: failed to load metadata
Caused by:
0: Failed to read metadata bytes from path .neon/tenants/X/timelines/Y/metadata
1: No such file or directory (os error 2)
```
So uninit mark got deleted together with timeline directory but we still
got directory entry for it and tried to load it.
The bug prevented tenant from being successfully loaded.
## Summary of changes
Ideally I think we shouldnt place uninit marks in the same directory as timeline directories but move them to separate directory and
gather them as an input to actual listing, but that would be sort of an
on-disk format change, so just check whether entries are still valid
before operating on them.