## Problem
Some metrics are better to be observed at page-server level. Otherwise,
as we have a lot of tenants in production, we cannot do a sum b/c
Prometheus has limit on how many time series we can aggregate. This also
helps reduce metrics scraping size.
## Summary of changes
Some integration tests are likely not to pass as it will check the
existence of some metrics. Waiting for CI complete and fix them.
Metrics downgraded: page cache hit (where we are likely to have a
page-server level page cache in the future instead of per-tenant), and
reconstruct time (this would better be tenant-level, as we have one pg
replayer for each tenant, but now we make it page-server level as we do
not need that fine-grained data).
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).
# Full Stack Of Preliminary PRs
Thanks to the countless preliminary PRs, this conversion is relatively
straight-forward.
1. Clean-ups
* https://github.com/neondatabase/neon/pull/4316
* https://github.com/neondatabase/neon/pull/4317
* https://github.com/neondatabase/neon/pull/4318
* https://github.com/neondatabase/neon/pull/4319
* https://github.com/neondatabase/neon/pull/4321
* Note: these were mostly to find an alternative to #4291, which I
thought we'd need in my original plan where we would need to convert
`Tenant::timelines` into an async locking primitive (#4333). In reviews,
we walked away from that, but these cleanups were still quite useful.
2. https://github.com/neondatabase/neon/pull/4364
3. https://github.com/neondatabase/neon/pull/4472
4. https://github.com/neondatabase/neon/pull/4476
5. https://github.com/neondatabase/neon/pull/4477
6. https://github.com/neondatabase/neon/pull/4485
# Significant Changes In This PR
## `compact_level0_phase1` & `create_delta_layer`
This commit partially reverts
"pgserver: spawn_blocking in compaction (#4265)"
4e359db4c7.
Specifically, it reverts the `spawn_blocking`-ificiation of
`compact_level0_phase1`.
If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()`
inside `compact_level0_phase1`. That would use up a thread in the
`spawn_blocking` thread pool, which is hard-capped.
I considered wrapping the code that follows the second
`layers.read().await` into `spawn_blocking`, but there are lifetime
issues with `deltas_to_compact`.
Also, this PR switches the `create_delta_layer` _function_ back to
async, and uses `spawn_blocking` inside to run the code that does sync
IO, while keeping the code that needs to lock `Timeline::layers` async.
## `LayerIter` and `LayerKeyIter` `Send` bounds
I had to add a `Send` bound on the `dyn` type that `LayerIter`
and `LayerKeyIter` wrap. Why? Because we now have the second
`layers.read().await` inside `compact_level0_phase`, and these
iterator instances are held across that await-point.
More background:
https://github.com/neondatabase/neon/pull/4462#issuecomment-1587376960
## `DatadirModification::flush`
Needed to replace the `HashMap::retain` with a hand-rolled variant
because `TimelineWriter::put` is now async.
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
Or more specifically, #4441, where we turn Timeline::layers into a
tokio::sync::RwLock.
By using try_write() here, we can avoid turning init_empty_layer_map
async,
which is nice because much of its transitive call(er) graph isn't async.
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
There, we want to switch `Timeline::layers` to be a
`tokio::sync::RwLock`.
That will require the `TimelineWriter` to become async, because at times
its functions need to lock `Timeline::layers` in order to freeze the
open layer.
While doing that, rustc complains that we're now holding
`Timeline::write_lock` across await points (lock order is that
`write_lock` must be acquired before `Timelines::layers`).
So, we need to switch it over to an async primitive.
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
There, we want to switch `Timeline::layers` to be a
`tokio::sync::RwLock`.
That will require the `TimelineWriter` to become async.
That will require `freeze_inmem_layer` to become async.
So, inside check_checkpoint_distance, we will have
`freeze_inmem_layer().await`.
But current rustc isn't smart enough to understand that we
`drop(layers)` earlier, and hence, will complain about the `!Send`
`layers` being held across the `freeze_inmem_layer().await`-point.
This patch puts the guard into a scope, so rustc will shut up in the
next patch where we make the transition for `TimelineWriter`.
obsoletes https://github.com/neondatabase/neon/pull/4474
Commit `create_test_timeline: always put@initdb_lsn the minimum required keys`
already switched us over to using valid initdb_lsns.
All that's left to do is to actually flush the minimum keys so that
we move from disk_consistent_lsn=Lsn(0) to disk_consistent_lsn=initdb_lsn.
Co-authored-by: Christian Schwarz <christian@neon.tech>
Part of https://github.com/neondatabase/neon/pull/4364
Clarify who's responsible for initializing the layer map. There were
previously two different ways to do it:
- create_empty_timeline and bootstrap_timeline let prepare_timeline()
initialize an empty layer map.
- branch_timeline passed a flag to initialize_with_lock() to tell
initialize_with_lock to call load_layer_map(). Because it was a
newly created timeline, load_layer_map() never found any layer
files, so it just initialized an empty layer map.
With this commit, prepare_new_timeline() always does it. The LSN to
initialize it with is passed as argument.
Other changes per function:
prepare_timeline:
- rename to 'prepare_new_timeline' to make it clear that it's only used
when creating a new timeline, not when loading an existing timeline
- always initialize an empty layer map. The caller can pass the LSN to
initialize it with. (Previously, prepare_timeline would optionally
load the layer map at 'initdb_lsn'. Some caller used that, while others
let initialize_with_lock do it
initialize_with_lock:
- As mentioned above, remove the option to load the layer map
- Acquire the 'timelines' lock in the function itself. None of the callers
did any other work while holding the lock.
- Rename it to finish_creation() to make its intent more clear. It's only
used when creating a new timeline now.
create_timeline_data:
- Rename to create_timeline_struct() for clarity. It just initializes
the Timeline struct, not any other "data"
create_timeline_files:
- use create_dir rather than create_dir_all, to be a little more strict.
We know that the parent directory should already exist, and the timeline
directory should not exist.
- Move the call to create_timeline_struct() to the caller. It was just
being "passed through"
Part of https://github.com/neondatabase/neon/pull/4364
This patch inlines `initialize_with_lock` and then reorganizes the code
such that we can `load_layer_map` without holding the
`Tenant::timelines` lock.
As a nice aside, we can get rid of the dummy() uninit mark, which has
always been a terrible hack.
Part of https://github.com/neondatabase/neon/pull/4364
## Problem
Part of https://github.com/neondatabase/neon/issues/4418
## Summary of changes
This PR implements the local manifest interfaces. After the refactor of
timeline is done, we can integrate this with the current storage. The
reader will stop at the first corrupted record.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Delete data from s3 when timeline deletion is requested
## Summary of changes
UploadQueue is altered to support scheduling of delete operations in
stopped state. This looks weird, and I'm thinking whether there are
better options/refactorings for upload client to make it look better.
Probably can be part of https://github.com/neondatabase/neon/issues/4378
Deletion is implemented directly in existing endpoint because changes are not
that significant. If we want more safety we can separate those or create
feature flag for new behavior.
resolves [#4193](https://github.com/neondatabase/neon/issues/4193)
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/4392
## Summary of changes
This PR adds a new HashMap that maps persistent layer desc to the layer
object *inside* LayerMap. Originally I directly went towards adding such
layer cache in Timeline, but the changes are too many and cannot be
reviewed as a reasonably-sized PR. Therefore, we take this intermediate
step to change part of the codebase to use persistent layer desc, and
come up with other PRs to move this hash map of layer desc to the
timeline struct.
Also, file_size is now part of the layer desc.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
## Problem
Attach failures are not reported in public part of the api (in
`attachment_status` field of TenantInfo).
## Summary of changes
Expose TenantState::Broken as TenantAttachmentStatus::Failed
In the way its written Failed status will be reported even if no
attachment happened. (I e if tenant become broken on startup). This is
in line with other members. I e Active will be resolved to Attached even
if no actual attach took place.
This can be tweaked if needed. At the current stage it would be overengineering without clear motivation
resolves#4344
Initial logical size calculation could still hinder our fast startup
efforts in #4397. See #4183. In deployment of 2023-06-06
about a 200 initial logical sizes were calculated on hosts which
took the longest to complete initial load (12s).
Implements the three step/tier initialization ordering described in
#4397:
1. load local tenants
2. do initial logical sizes per walreceivers for 10s
3. background tasks
Ordering is controlled by:
- waiting on `utils::completion::Barrier`s on background tasks
- having one attempt for each Timeline to do initial logical size
calculation
- `pageserver/src/bin/pageserver.rs` releasing background jobs after
timeout or completion of initial logical size calculation
The timeout is there just to safeguard in case a legitimate non-broken
timeline initial logical size calculation goes long. The timeout is
configurable, by default 10s, which I think would be fine for production
systems. In the test cases I've been looking at, it seems that these
steps are completed as fast as possible.
Co-authored-by: Christian Schwarz <christian@neon.tech>
We have 2 ways of tenant shutdown, we should have just one.
Changes are mostly mechanical simple refactorings.
Added `warn!` on the "shutdown all remaining tasks" should trigger test
failures in the between time of not having solved the "tenant/timeline
owns all spawned tasks" issue.
Cc: #4327.
walreceiver logs are a bit hard to understand because of partial span
usage, extra messages, ignored errors popping up as huge stacktraces.
Fixes#3330 (by spans, also demote info -> debug).
- arrange walreceivers spans into a hiearchy:
- `wal_connection_manager{tenant_id, timeline_id}` ->
`connection{node_id}` -> `poller`
- unifies the error reporting inside `wal_receiver`:
- All ok errors are now `walreceiver connection handling ended: {e:#}`
- All unknown errors are still stacktraceful task_mgr reported errors
with context `walreceiver connection handling failure`
- Remove `connect` special casing, was: `DB connection stream finished`
for ok errors
- Remove `done replicating` special casing, was `Replication stream
finished` for ok errors
- lowered log levels for (non-exhaustive list):
- `WAL receiver manager started, connecting to broker` (at startup)
- `WAL receiver shutdown requested, shutting down` (at shutdown)
- `Connection manager loop ended, shutting down` (at shutdown)
- `sender is dropped while join handle is still alive` (at lucky
shutdown, see #2885)
- `timeline entered terminal state {:?}, stopping wal connection manager
loop` (at shutdown)
- `connected!` (at startup)
- `Walreceiver db connection closed` (at disconnects?, was without span)
- `Connection cancelled` (at shutdown, was without span)
- `observed timeline state change, new state is {new_state:?}` (never
after Timeline::activate was made infallible)
- changed:
- `Timeline dropped state updates sender, stopping wal connection
manager loop`
- was out of date; sender is not dropped but `Broken | Stopping` state
transition
- also made `debug!`
- `Timeline dropped state updates sender before becoming active,
stopping wal connection manager loop`
- was out of date: sender is again not dropped but `Broken | Stopping`
state transition
- also made `debug!`
- log fixes:
- stop double reporting panics via JoinError
As seen on deployment of 2023-06-01 release, times were improving but
there were some outliers caused by:
- timelines `eviction_task` starting while activating and running
imitation
- timelines `initial logical size` calculation
This PR fixes it so that `eviction_task` is delayed like other
background tasks fixing an oversight from earlier #4372.
After this PR activation will be two phases:
1. load and activate tenants AND calculate some initial logical sizes
2. rest of initial logical sizes AND background tasks
- compaction, gc, disk usage based eviction, timelines `eviction_task`,
consumption metrics
We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.
This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.
The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function might
run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is a
nice extra reminder.
Spawning a task for each request is OK from a performance point of view
because spawning is very cheap in Tokio, and none of our HTTP requests
are very performance critical anyway.
Fixes issue #3478
## Problem
Part of https://github.com/neondatabase/neon/issues/4373
## Summary of changes
This PR adds `PersistentLayerDesc`, which will be used in LayerMap
mapping and probably layer cache. After this PR and after we change
LayerMap to map to layer desc, we can safely drop RemoteLayerDesc.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
## Problem
This PR includes doc changes to the current metrics as well as adding
new metrics. With the new set of metrics, we can quantitatively analyze
the read amp., write amp. and space amp. in the system, when used
together with https://github.com/neondatabase/neonbench
close https://github.com/neondatabase/neon/issues/4312
ref https://github.com/neondatabase/neon/issues/4347
compaction metrics TBD, a novel idea is to print L0 file number and
number of layers in the system, and we can do this in the future when we
start working on compaction.
## Summary of changes
* Add `READ_NUM_FS_LAYERS` for computing read amp.
* Add `MATERIALIZED_PAGE_CACHE_HIT_UPON_REQUEST`.
* Add `GET_RECONSTRUCT_DATA_TIME`. GET_RECONSTRUCT_DATA_TIME +
RECONSTRUCT_TIME + WAIT_LSN_TIME should be approximately total time of
reads.
* Add `5.0` and `10.0` to `STORAGE_IO_TIME_BUCKETS` given some fsync
runs slow (i.e., > 1s) in some cases.
* Some `WAL_REDO` metrics are only used when Postgres is involved in the
redo process.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
This parameter can be use to restrict number of image layers generated
because of GC request (wanted image layers).
Been set to zero it completely eliminates creation of such image layers.
So it allows to avoid extra storage consumption after merging #3673
## Problem
PR #3673 forces generation of missed image layers. So i short term is
cause cause increase (in worst case up to two times) size of storage.
It was intended (by me) that GC period is comparable with PiTR interval.
But looks like it is not the case now - GC is performed much more
frequently. It may cause the problem with space exhaustion: GC forces
new image creation while large PiTR still prevent GC from collecting old
layers.
## Summary of changes
Add new pageserver parameter` forced_image_creation_limit` which
restrict number of created image layers which are requested by GC.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Startup continues to be slow, work towards to alleviate it.
Summary of changes:
- pretty the functional improvements from #4366 into
`utils::completion::{Completion, Barrier}`
- extend "initial load completion" usage up to tenant background tasks
- previously only global background tasks
- spawn_blocking the tenant load directory traversal
- demote some logging
- remove some unwraps
- propagate some spans to `spawn_blocking`
Runtime effects should be major speedup to loading, but after that, the
`BACKGROUND_RUNTIME` will be blocked for a long time (minutes). Possible
follow-ups:
- complete initial tenant sizes before allowing background tasks to
block the `BACKGROUND_RUNTIME`
Startup can take a long time. We suspect it's the initial logical size
calculations. Long term solution is to not block the tokio executors but
do most of I/O in spawn_blocking.
See: #4025, #4183
Short-term solution to above:
- Delay global background tasks until initial tenant loads complete
- Just limit how many init logical size calculations can we have at the
same time to `cores / 2`
This PR is for trying in staging.
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
The motivation is to avoid locking `Tenant::timelines` in places that
can't be `async`, because in #4333 we want to convert Tenant::timelines
from `std::sync::Mutex` to `tokio::sync::Mutex`.
But, the changes here are useful in general because they clean up &
document tenant state transitions.
That also paves the way for #4350, which is an alternative to #4333 that
refactors the pageserver code so that we can keep the
`Tenant::timelines` mutex sync.
This patch consists of the following core insights and changes:
* spawn_load and spawn_attach own the tenant state until they're done
* once load()/attach() calls are done ...
* if they failed, transition them to Broken directly (we know that
there's no background activity because we didn't call activate yet)
* if they succeed, call activate. We can make it infallible. How? Later.
* set_broken() and set_stopping() are changed to wait for spawn_load() /
spawn_attach() to finish.
* This sounds scary because it might hinder detach or shutdown, but
actually, concurrent attach+detach, or attach+shutdown, or
load+shutdown, or attach+shutdown were just racy before this PR.
So, with this change, they're not anymore.
In the future, we can add a `CancellationToken` stored in Tenant to
cancel `load` and `attach` faster, i.e., make `spawn_load` /
`spawn_attach` transition them to Broken state sooner.
See the doc comments on TenantState for the state transitions that are
now possible.
It might seem scary, but actually, this patch reduces the possible state
transitions.
We introduce a new state `TenantState::Activating` to avoid grabbing the
`Tenant::timelines` lock inside the `send_modify` closure.
These were the humble beginnings of this PR (see Motivation section),
and I think it's still the right thing to have this `Activating` state,
even if we decide against async `Tenant::timelines` mutex. The reason is
that `send_modify` locks internally, and by moving locking of
Tenant::timelines out of the closure, the internal locking of
`send_modify` becomes a leaf of the lock graph, and so, we eliminate
deadlock risk.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit lock for the deletion.
Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because the
client disconnected, the timeline deletion stops half-way through the
operation. That needs to be fixed, too, but that's a separate story.
(This is a simpler replacement for PR #4194. I'm also working on the
cancellation shielding, see PR #4314.)
Previously, you used it like this:
|r| RequestSpan(my_handler).handle(r)
But I don't see the point of the RequestSpan struct. It's just a
wrapper around the handler function. With this commit, the call
becomes:
|r| request_span(r, my_handler)
Which seems a little simpler.
At first I thought that the RequestSpan struct would allow "chaining"
other kinds of decorators like RequestSpan, so that you could do
something like this:
|r| CheckPermissions(RequestSpan(my_handler)).handle(r)
But it doesn't work like that. If each of those structs wrap a handler
*function*, it would actually look like this:
|r| CheckPermissions(|r| RequestSpan(my_handler).handle(r))).handle(r)
This commit doesn't make that kind of chaining any easier, but seems a
little more straightforward anyway.
Compaction is usually a compute-heavy process and might affect other
futures running on the thread of the compaction. Therefore, we add
`block_in_place` as a temporary solution to avoid blocking other futures
on the same thread as compaction in the runtime. As we are migrating
towards a fully-async-style pageserver, we can revert this change when
everything is async and when we move compaction to a separate runtime.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
We used to generate the ID, if the caller didn't specify it. That's bad
practice, however, because network is never fully reliable, so it's
possible we create a new tenant but the caller doesn't know about it,
and because it doesn't know the tenant ID, it has no way of retrying or
checking if it succeeded. To discourage that, make it mandatory. The web
control plane has not relied on the auto-generation for a long time.
Timeline::activate() was only fallible because `launch_wal_receiver`
was.
`launch_wal_receiver` was fallible only because of some preliminary
checks in `WalReceiver::start`.
Turns out these checks can be shifted to the type system by delaying
creatinon of the `WalReceiver` struct to the point where we activate the
timeline.
The changes in this PR were enabled by my previous refactoring that
funneled the broker_client from pageserver startup to the activate()
call sites.
Patch series:
- #4316
- #4317
- #4318
- #4319
(This is prep work to make `Timeline::activate` infallible.)
This patch removes the global storage_broker client instance from the
pageserver codebase.
Instead, pageserver startup instantiates it and passes it down to the
`Timeline::activate` function, which in turn passes it to the
WalReceiver, which is the entity that actually uses it.
Patch series:
- #4316
- #4317
- #4318
- #4319
Before this patch, it would use error type `TenantStateError` which has
many more error variants than can actually happen with
`mgr::get_tenant`.
Along the way, I also introduced `SetNewTenantConfigError` because it
uses `mgr::get_tenant` and also can only fail in much fewer ways than
`TenantStateError` suggests.
The new `page_service.rs`'s `GetActiveTimelineError` and
`GetActiveTenantError` types were necessary to avoid an `Other` variant
on the `GetTenantError`.
This patch is a by-product of reading code that subscribes to
`Tenant::state` changes.
Can't really connect it to any given project.
(Instead of going through mgr every iteration.)
The `wait_for_active_tenant` function's `wait` argument could be removed
because it was only used for the loop that waits for the tenant to show
up in the tenants map. Since we're passing the tenant in, we now longer
need to get it from the tenants map.
NB that there's no guarantee that the tenant object is in the tenants
map at the time the background loop function starts running. But the
tenant mgr guarantees that it will be quite soon. See
`tenant_map_insert` way upwards in the call hierarchy for details.
This is prep work to eliminate `subscribe_for_state_updates` (PR #4299 )
Fixes: #3501
This PR refactors the original page_binutils with a single tool pagectl,
use clap derive for better command line parsing, and adds the dump kv
tool to extract information from delta file. This helps me better
understand what's inside the page server. We can add support for other
types of file and more functionalities in the future.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
(This is prep work to make `Timeline::activate()` infallible.)
The current possibility for failure in `Timeline::activate()` is the
broker client's presence / absence. It should be an assert, but we're
careful with these. So, I'm planning to pass in the broker client to
activate(), thereby eliminating the possiblity of its absence.
In the unit tests, we don't have a broker client. So, I thought I'd be
in trouble because the unit tests also called `activate()` before this
PR.
However, closer inspection reveals a long-standing FIXME about this,
which is addressed by this patch.
It turns out that the unit tests don't actually need the background
loops to be running. They just need the state value to be `Active`. So,
for the tests, we just set it to that value but don't spawn the
background loops.
We'll need to revisit this if we ever do more Rust unit tests in the
future. But right now, this refactoring improves the code, so, let's
revisit when we get there.
Patch series:
- #4316
- #4317
- #4318
- #4319
## Describe your changes
Right now the only criteria for image layer generation is number of
delta layer since last image layer.
If we have "stairs" layout of delta layers (see link below) then it can
happen that there a lot of old delta layers which can not be reclaimed
by GC because are not fully covered with image layers.
This PR constructs list of "wanted" image layers in GC (which image
layers are needed to be able to remove old layers)
and pass this list to compaction task which performs generation of image
layers.
So right now except deltas count criteria we also take in account
"wishes" of GC.
## Issue ticket number and link
See
https://neondb.slack.com/archives/C033RQ5SPDH/p1676914249982519
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
This patch fixes parsing of the `max_lsn_wal_lag` tenant config item.
We were incorrectly expecting a string before, but the type is a
NonZeroU64.
So, when setting it in the config, the (updated) test case would fail
with
```
E psycopg2.errors.InternalError_: Tenant a1fa9cc383e32ddafb73ff920de5f2e6 will not become active. Current state: Broken due to: Failed to parse config from file '.../repo/tenants/a1fa9cc383e32ddafb73ff920de5f2e6/config' as pageserver config: configure option max_lsn_wal_lag is not a string. Backtrace:
```
So, not even the assertions added are necessary.
The test coverage for tenant config is rather thin in general.
For example, the `test_tenant_conf.py` test doesn't cover all the
options.
I'll add a new regression test as part of attach-time-tenant-conf PR
https://github.com/neondatabase/neon/pull/4255