## Problem
The communicator gRPC client currently attempts to pipeline GetPage
requests from multiple callers onto the same gRPC stream. This has a
number of issues:
* Head-of-line blocking: the request may block on e.g. layer download or
LSN wait, delaying the next request.
* Cancellation: we can't easily cancel in-progress requests (e.g. due to
timeout or backend termination), so it may keep blocking the next
request (even its own retry).
* Complex stream scheduling: picking a stream becomes harder/slower, and
additional Tokio tasks and synchronization is needed for stream
management.
Touches #11735.
Requires #12579.
## Summary of changes
This patch removes pipelining of gRPC stream requests, and instead
prefers to scale out the number of streams to achieve the same
throughput. Stream scheduling has been rewritten, and mostly follows the
same pattern as the client pool with exclusive acquisition by a single
caller.
[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.
This has a number of advantages:
* It (mostly) eliminates head-of-line blocking (except at the TCP
level).
* Cancellation becomes trivial, by closing the stream.
* Stream scheduling becomes significantly simpler and cheaper.
* Individual callers can still use client-side batching for pipelining.
## Problem
The new communicator gRPC client has significantly worse Pagebench
performance than a basic gRPC client. We need to find out why.
## Summary of changes
Add a `pagebench --profile` flag which takes a client CPU profile of the
benchmark and writes a flamegraph to `profile.svg`.
## Problem
It can take 3x the idle timeout to reap a channel. We have to wait for
the idle timeout to trigger first for the stream, then the client, then
the channel.
Touches #11735.
## Summary of changes
Reap empty channels immediately, and rely indirectly on the
channel/stream timeouts.
This can still lead to 2x the idle timeout for streams (first stream
then client), but that's okay -- if the stream closes abruptly (e.g. due
to timeout or error) we want to keep the client around in the pool for a
while.
## Problem
gRPC client retries currently include pool acquisition under the
per-attempt timeout. If pool acquisition is slow (e.g. full pool), this
will cause spurious timeout warnings, and the caller will lose its place
in the pool queue.
Touches #11735.
## Summary of changes
Makes several improvements to retries and related logic:
* Don't include pool acquisition time under request timeouts.
* Move attempt timeouts out of `Retry` and into the closure.
* Make `Retry` configurable, move constants into main module.
* Don't backoff on the first retry, and reduce initial/max backoffs to
5ms and 5s respectively.
* Add `with_retries` and `with_timeout` helpers.
* Add slow logging for pool acquisition, and a `warn_slow` counterpart
to `log_slow`.
* Add debug logging for requests and responses at the client boundary.
## Problem
For the communicator scheduling policy, we need to understand the
server-side cost of idle gRPC streams.
Touches #11735.
## Summary of changes
Add an `idle-streams` benchmark to `pagebench` which opens a large
number of idle gRPC GetPage streams.
## Problem
When refreshing cancellation data we resend the entire value again just
to reset the TTL, which causes unnecessary load in proxy, on network and
possibly on redis side.
## Summary of changes
* Switch from using SET with full value to using EXPIRE to reset TTL.
* Add a tiny delay between retries to prevent busy loop.
* Shorten CancelKeyOp variants: drop redundant suffix.
* Retry SET when EXPIRE failed.
## Problem
When a connection terminates its maintain_cancel_key task keeps running
until the CANCEL_KEY_REFRESH sleep finishes and then it triggers another
cancel key TTL refresh before exiting.
## Summary of changes
* Check for cancellation while sleeping and interrupt sleep.
* If cancelled, break the loop, don't send a refresh cmd.
## Problem
We don't validate the validity of the `new_sk_set` before starting the
migration. It is validated later, so the migration to an invalid
safekeeper set will fail anyway. But at this point we might already
commited an invalid `new_sk_set` to the database and there is no `abort`
command yet (I ran into this issue in neon_local and ruined the timeline
:)
- Part of https://github.com/neondatabase/neon/issues/11669
## Summary of changes
- Add safekeeper count and safekeeper duplication checks before starting
the migration
- Test that we validate the `new_sk_set` before starting the migration
- Add `force` option to the `TimelineSafekeeperMigrateRequest` to
disable not-mandatory checks
Serialize query row responses directly into JSON. Some of this code
should be using the `json::value_as_object/list` macros, but I've
avoided it for now to minimize the size of the diff.
## Problem
To store cancellation data we send two commands to redis because the
redis server version doesn't support HSET with EX. Also, HSET is not
really needed.
## Summary of changes
* Replace the HSET + EXPIRE command pair with one SET .. EX command.
* Replace HGET with GET.
* Leave a workaround for old keys set with HSET.
* Replace some anyhow errors with specific errors to surface the
WRONGTYPE error from redis.
# TLDR
Problem-I is a bug fix. The rest are no-ops.
## Problem I
Page server checks image layer creation based on the elapsed time but
this check depends on the current logical size, which is only computed
on shard 0. Thus, for non-0 shards, the check will be ineffective and
image creation will never be done for idle tenants.
## Summary of changes I
This PR fixes the problem by simply removing the dependency on current
logical size.
## Summary of changes II
This PR adds a timeout when calling page server to split shard to make
sure SC does not wait for the API call forever. Currently the PR doesn't
adds any retry logic because it's not clear whether page server shard
split can be safely retried if the existing operation is still ongoing
or left the storage in a bad state. Thus it's better to abort the whole
operation and restart.
## Problem III
`test_remote_failures` requires PS to be compiled in the testing mode.
For PS in dev/staging, they are compiled without this mode.
## Summary of changes III
Remove the restriction and also increase the number of total failures
allowed.
## Summary of changes IV
remove test on PS getpage http route.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
## Problem
close LKB-162
close https://github.com/neondatabase/cloud/issues/30665, related to
https://github.com/neondatabase/cloud/issues/29434
We see a lot of errors like:
```
2025-05-22T23:06:14.928959Z ERROR compaction_loop{tenant_id=? shard_id=0304}:run:gc_compact_timeline{timeline_id=?}: error applying 4 WAL records 35/DC0DF0B8..3B/E43188C0 (8119 bytes) to key 000000067F0000400500006027000000B9D0, from base image with LSN 0/0 to reconstruct page image at LSN 61/150B9B20 n_attempts=0: apply_wal_records
Caused by:
0: read walredo stdout
1: early eof
```
which is an acceptable form of error and we should downgrade it to
warning.
## Summary of changes
walredo error during gc-compaction is expected when the data below the
gc horizon does not contain a full key history. This is possible in some
rare cases of gc that is only able to remove data in the middle of the
history but not all earlier history when a full keyspace gets deleted.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Fixes [LKB-61](https://databricks.atlassian.net/browse/LKB-61):
`test_timeline_archival_chaos` being flaky with storcon error `Requested
tenant is missing`.
When a tenant migration is ongoing, and the attach request has been sent
to the new location, but the attach hasn't finished yet, it is possible
for the pageserver to return a 412 precondition failed HTTP error on
timeline deletion, because it is being sent to the new location already.
That one we would previously log via sth like:
```
ERROR request{method=DELETE path=/v1/tenant/1f544a11c90d1afd7af9b26e48985a4e/timeline/32818fb3ebf07cb7f06805429d7dee38 request_id=c493c04b-7f33-46d2-8a65-aac8a5516055}: Error processing HTTP request: InternalServerError(Error deleting timeline 32
818fb3ebf07cb7f06805429d7dee38 on 1f544a11c90d1afd7af9b26e48985a4e on node 2 (localhost): pageserver API: Precondition failed: Requested tenant is missing
```
This patch changes that and makes us return a more reasonable resource
unavailable error. Not sure how scalable this is with tenants with a
large number of shards, but that's a different discussion (we'd probably
need a limited amount of per-storcon retries).
example
[link](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12398/15981821532/index.html#/testresult/e7785dfb1238d92f).
Update the WSS estimate before acquring the lock, so that we don't need
to hold the lock for so long. That seems safe to me, see added comment.
I was planning to do this with the new rust-based communicator
implementation anyway, but it might help a little with the current C
implementation too. And more importantly, having this as a separate PR
gives us a chance to review this aspect independently.
## Problem
Canceelation requires redis, redis required control-plane.
## Summary of changes
Make redis for cancellation not require control plane.
Add instructions for setting up redis locally.
Split the functions into two: one internal function to calculate the
estimate, and another (two functions) to expose it as SQL functions.
This is in preparation of adding new communicator implementation. With
that, the SQL functions will dispatch the call to the old or new
implementation depending on which is being used.
This is a no-op for the neon deployment
* Introduce the concept image consistent lsn: of the largest LSN below
which all pages have been redone successfully
* Use the image consistent LSN for forced image layer creations
* Optionally expose the image consistent LSN via the timeline describe
HTTP endpoint
* Add a sharded timeline describe endpoint to storcon
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
## Problem
We have a `safekeeper_migrate` handler, but no subcommand in
`storcon_cli`. Same for `/:timeline_id/locate` for identifying current
set of safekeepers.
- Closes: https://github.com/neondatabase/neon/issues/12395
## Summary of changes
- Add `timeline-safekeeper-migrate` and `timeline-locate` subcommands to
`storcon_cli`
On December 8th, 2023, an engineering escalation (INC-110) was opened
after it was found that BYPASSRLS was being applied to all roles.
PR that introduced the issue:
https://github.com/neondatabase/neon/pull/5657
Subsequent commit on main:
ad99fa5f03
NOBYPASSRLS and INHERIT are the defaults for a Postgres role, but
because it isn't easy to know if a Postgres cluster is affected by the
issue, we need to keep the migration around for a long time, if not
indefinitely, so any cluster can be fixed.
Branching is the gift that keeps on giving...
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
## Problem
With gRPC `GetPageRequest` batches, we'll have non-trivial
fragmentation/reassembly logic in several places of the stack
(concurrent reads, shard splits, LFC hits, etc). If we included the
block numbers with the pages in `GetPageResponse` we could have better
verification and observability that the final responses are correct.
Touches #11735.
Requires #12480.
## Summary of changes
Add a `Page` struct with`block_number` for `GetPageResponse`, along with
the `RelTag` for completeness, and verify them in the rich gRPC client.
## Problem
Part of LKB-379
The pageserver connstrings are updated in the postmaster and then
there's a hook to propagate it to the shared memory of all backends.
However, the shard stripe doesn't. This would cause problems during
shard splits:
* the compute has active reads/writes
* shard split happens and the cplane applies the new config (pageserver
connstring + stripe size)
* pageserver connstring will be updated immediately once the postmaster
receives the SIGHUP, and it will be copied over the the shared memory of
all other backends.
* stripe size is a normal GUC and we don't have special handling around
that, so if any active backend has ongoing txns the value won't be
applied.
* now it's possible for backends to issue requests based on the wrong
stripe size; what's worse, if a request gets cached in the prefetch
buffer, it will get stuck forever.
## Summary of changes
To make sure it aligns with the current default in storcon.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`GetPageRequest::request_id` is supposed to be a unique ID for a
request. It's not, because we may retry the request using the same ID.
This causes assertion failures and confusion.
Touches #11735.
Requires #12480.
## Summary of changes
Extend the request ID with a retry attempt, and handle it in the gRPC
client and server.
## Problem
One PG tenant may write too fast and overwhelm the PS. The other tenants
sharing the same PSs will get very little bandwidth.
We had one experiment that two tenants sharing the same PSs. One tenant
runs a large ingestion that delivers hundreds of MB/s while the other
only get < 10 MB/s.
## Summary of changes
Rate limit how fast PG can generate WALs. The default is -1. We may
scale the default value with the CPU count. Need to run some experiments
to verify.
## How is this tested?
CI.
PGBench. No limit first. Then set to 1 MB/s and you can see the tps
drop. Then reverted the change and tps increased again.
pgbench -i -s 10 -p 55432 -h 127.0.0.1 -U cloud_admin -d postgres
pgbench postgres -c 10 -j 10 -T 6000000 -P 1 -b tpcb-like -h 127.0.0.1
-U cloud_admin -p 55432
progress: 33.0 s, 986.0 tps, lat 10.142 ms stddev 3.856 progress: 34.0
s, 973.0 tps, lat 10.299 ms stddev 3.857 progress: 35.0 s, 1004.0 tps,
lat 9.939 ms stddev 3.604 progress: 36.0 s, 984.0 tps, lat 10.183 ms
stddev 3.713 progress: 37.0 s, 998.0 tps, lat 10.004 ms stddev 3.668
progress: 38.0 s, 648.9 tps, lat 12.947 ms stddev 24.970 progress: 39.0
s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 40.0 s, 0.0 tps, lat
0.000 ms stddev 0.000 progress: 41.0 s, 0.0 tps, lat 0.000 ms stddev
0.000 progress: 42.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress:
43.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 44.0 s, 0.0 tps,
lat 0.000 ms stddev 0.000 progress: 45.0 s, 0.0 tps, lat 0.000 ms stddev
0.000 progress: 46.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress:
47.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 48.0 s, 0.0 tps,
lat 0.000 ms stddev 0.000 progress: 49.0 s, 347.3 tps, lat 321.560 ms
stddev 1805.633 progress: 50.0 s, 346.8 tps, lat 9.898 ms stddev 3.809
progress: 51.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 52.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 53.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 54.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 55.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 56.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 57.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 58.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 59.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 60.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 61.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 62.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 63.0 s, 494.5 tps, lat 276.504 ms stddev 1853.689 progress:
64.0 s, 488.0 tps, lat 20.530 ms stddev 71.981 progress: 65.0 s, 407.8
tps, lat 9.502 ms stddev 3.329 progress: 66.0 s, 0.0 tps, lat 0.000 ms
stddev 0.000 progress: 67.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 68.0 s, 504.5 tps, lat 71.627 ms stddev 397.733 progress: 69.0
s, 371.0 tps, lat 24.898 ms stddev 29.007 progress: 70.0 s, 541.0 tps,
lat 19.684 ms stddev 24.094 progress: 71.0 s, 342.0 tps, lat 29.542 ms
stddev 54.935
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
After https://github.com/neondatabase/neon/pull/12240 we observed
issues in our go code as `ComputeStatus` is not stateless, thus doesn't
deserialize as string.
```
could not check compute activity: json: cannot unmarshal object into Go struct field
ComputeState.status of type computeclient.ComputeStatus
```
- Fix this by splitting this status into two.
- Update compute OpenApi spec to reflect changes to `/terminate` in
previous PR
## Problem
If we have catalog update AND a pageserver migration batched in a single
spec, we will not be able to apply the spec (running the SQL) because
the compute is not attached to the right pageserver and we are not able
to read anything if we don't pick up the latest pageserver connstring.
This is not a case for now because cplane always schedules shard split /
pageserver migrations with `skip_pg_catalog_updates` (I suppose).
Context:
https://databricks.slack.com/archives/C09254R641L/p1752163559259399?thread_ts=1752160163.141149&cid=C09254R641L
With this fix, backpressure will likely not be able to affect
reconfigurations.
## Summary of changes
Do `pg_reload_conf` before we apply specs in SQL.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We need to benchmark the rich gRPC client
`client_grpc::PageserverClient` against the basic, no-frills
`page_api::Client` to determine how much overhead it adds.
Touches #11735.
Requires #12476.
## Summary of changes
Add a `pagebench --rich-client` parameter to use
`client_grpc::PageserverClient`. Also adds a compression parameter to
the client.
## Problem
Sometimes we run out of free ports in `PortDistributor`. This affects
particularly failed tests that we rerun automatically up to 3 times
(which makes it use up to 3x more ports)
## Summary of changes
- Cycle over the range of ports to reuse freed ports from previous tests
Ref: LKB-62
## Problem
The communicator gRPC client must support changing the shard map on
splits.
Touches #11735.
Requires #12476.
## Summary of changes
* Wrap the shard set in a `ArcSwap` to allow swapping it out.
* Add a new `ShardSpec` parameter struct to pass validated shard info to
the client.
* Add `update_shards()` to change the shard set. In-flight requests are
allowed to complete using the old shards.
* Restructure `get_page` to use a stable view of the shard map, and
retry errors at the top (pre-split) level to pick up shard map changes.
* Also marks `tonic::Status::Internal` as non-retryable, so that we can
use it for client-side invariant checks without continually retrying
these.
# TLDR
All changes are no-op except some metrics.
## Summary of changes I
### Pageserver
Added a new global counter metric
`pageserver_pagestream_handler_results_total` that categorizes
pagestream request results according to their outcomes:
1. Success
2. Internal errors
3. Other errors
Internal errors include:
1. Page reconstruction error: This probably indicates a pageserver
bug/corruption
2. LSN timeout error: Could indicate overload or bugs with PS's ability
to reach other components
3. Misrouted request error: Indicates bugs in the Storage Controller/HCC
Other errors include transient errors that are expected during normal
operation or errors indicating bugs with other parts of the system
(e.g., malformed requests, errors due to cancelled operations during PS
shutdown, etc.)
## Summary of changes II
This PR adds a pageserver endpoint and its counterpart in storage
controller to list visible size of all tenant shards. This will be a
prerequisite of the tenant rebalance command.
## Problem III
We need a way to download WAL
segments/layerfiles from S3 and replay WAL records. We cannot access
production S3 from our laptops directly, and we also can't transfer any
user data out of production systems for GDPR compliance, so we need
solutions.
## Summary of changes III
This PR adds a couple of tools to support the debugging
workflow in production:
1. A new `pagectl download-remote-object` command that can be used to
download remote storage objects assuming the correct access is set up.
## Summary of changes IV
This PR adds a command to list all visible delta and image layers from
index_part. This is useful to debug compaction issues as index_part
often contain a lot of covered layers due to PITR.
---------
Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
## Problem
close LKB-253
## Summary of changes
404 for timeline requests could happen when the tenant is intended to be
on a pageserver but not attached yet. This patch adds handling for the
lease request. In the future, we should extend this handling to more
operations.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Safekeeper and pageserver metrics collection might time out. We've seen
this in both hadron and neon.
## Summary of changes
This PR moves metrics collection in PS/SK to the background so that we
will always get some metrics, despite there may be some delays. Will
leave it to the future work to reduce metrics collection time.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
## Problem
The gRPC client pools don't reap idle resources.
Touches #11735.
Requires #12475.
## Summary of changes
Reap idle pool resources (channels/clients/streams) after 3 minutes of
inactivity.
Also restructure the `StreamPool` to use a mutex rather than atomics for
synchronization, for simplicity. This will be optimized later.
This PR introduces a `image_creation_timeout` to page servers so that we
can force the image creation after a certain period. This is set to 1
day on dev/staging for now, and will rollout to production 1/2 weeks
later.
Majority of the PR are boilerplate code to add the new knob. Specific
changes of the PR are:
1. During L0 compaction, check if we should force a compaction if
min(LSN) of all delta layers < force_image_creation LSN.
2. During image creation, check if we should force a compaction if the
image's LSN < force_image_creation LSN and there are newer deltas with
overlapping key ranges.
3. Also tweaked the check image creation interval to make sure we honor
image_creation_timeout.
Vlad's note: This should be a no-op. I added an extra PS config for the
large timeline
threshold to enable this.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
When a function is owned by a superuser (bootstrap user or otherwise),
we consider it safe to run it. Only a superuser could have installed it,
typically from CREATE EXTENSION script: we trust the code to execute.
## Problem
This is intended to solve running pg_graphql Event Triggers
graphql_watch_ddl and graphql_watch_drop which are executing the secdef
function graphql.increment_schema_version().
## Summary of changes
Allow executing Event Trigger function owned by a superuser and with
SECURITY DEFINER properties. The Event Trigger code runs with superuser
privileges, and we consider that it's fine.
---------
Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
There are a couple of places that call `CompactionError::is_cancel` but
don't check the `::Other` variant via downcasting for root cause being
cancellation.
The only place that does it is `log_compaction_error`.
It's sad we have to do it, but, until we get around cleaning up all the
culprits,
a step forward is to unify the behavior so that all places that inspect
a
`CompactionError` for cancellation reason follow the same behavior.
Thus, this PR ...
- moves the downcasting checks against the `::Other` variant from
`log_compaction_error` into `is_cancel()` and
- enforces via type system that `.is_cancel()` is used to check whether
a CompactionError is due to cancellation (matching on the
`CompactionError::ShuttingDown` will cause a compile-time error).
I don't think there's a _serious_ case right now where matching instead
of using `is_cancel` causes problems.
The worst case I could find is the circuit breaker and
`compaction_failed`,
which don't really matter if we're shutting down the timeline anyway.
But it's unaesthetic and might cause log/alert noise down the line,
so, this PR fixes that at least.
Refs
- https://databricks.atlassian.net/browse/LKB-182
- slack conversation about this PR:
https://databricks.slack.com/archives/C09254R641L/p1751284317955159
## Problem
close LKB-199
## Summary of changes
We always return the error as 500 to the cplane if a LSN lease request
fails. This cause issues for the cplane as they don't retry on 500. This
patch correctly passes through the error and assign the error code so
that cplane can know if it is a retryable error. (TODO: look at the
cplane code and learn the retry logic).
Note that this patch does not resolve LKB-253 -- we need to handle not
found error separately in the lsn lease path, like wait until the tenant
gets attached, or return 503 so that cplane can retry.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Change the unreliable storage wrapper to fail by probability when there
are more failure attempts left.
Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
## Problem
Test `test_branch_creation_before_gc` is flaky in the internal repo.
Pageserver sometimes lags behind write LSN. When we call GC it might not
reach the LSN we try to create the branch at yet.
## Summary of changes
- Wait till flush lsn on pageserver reaches the latest LSN before
calling GC.
## Problem
GetPage bulk requests such as prefetches and vacuum can head-of-line
block foreground requests, causing increased latency.
Touches #11735.
Requires #12469.
## Summary of changes
* Use dedicated channel/client/stream pools for bulk GetPage requests.
* Use lower concurrency but higher queue depth for bulk pools.
* Make pool limits configurable.
* Require unbounded client pool for stream pool, to avoid accidental
starvation.
## Problem
Follow up of #12400
## Summary of changes
We didn't set remote_size_mb to Some when initialized so it never gets
computed :(
Also added a new API to force refresh the properties.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Due to a lag in replication, we sometimes cannot get the parent branch
definition just after completion of the Public API restore call. This
leads to the test failures.
https://databricks.atlassian.net/browse/LKB-279
## Summary of changes
The workaround is implemented. Now test retries up to 30 seconds,
waiting for the branch definition to appear.
---------
Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>