Commit Graph

3172 Commits

Author SHA1 Message Date
Heikki Linnakangas
e2c3c2eccb Merge remote-tracking branch 'origin/main' into HEAD 2025-07-20 00:58:57 +03:00
HaoyuHuang
8f627ea0ab A few more SC changes (#12649)
## Problem

## Summary of changes
2025-07-17 23:17:01 +00:00
Arpad Müller
6a353c33e3 print more timestamps in find_lsn_for_timestamp (#12641)
Observability of `find_lsn_for_timestamp` is lacking, as well as how and
when we update gc space and time cutoffs. Log them.
2025-07-17 22:13:21 +00:00
HaoyuHuang
b7fc5a2fe0 A few SC changes (#12615)
## Summary of changes
A bunch of no-op changes.

---------

Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-17 13:14:36 +00:00
Erik Grinaker
f765bd3677 pageserver: improve gRPC cancellation 2025-07-17 12:34:46 +02:00
Erik Grinaker
edcdd6ca9c Merge branch 'main' into communicator-rewrite 2025-07-17 10:59:37 +02:00
Alex Chi Z.
f2828bbe19 fix(pageserver): skip gc-compaction for metadata key ranges (#12618)
## Problem

part of https://github.com/neondatabase/neon/issues/11318 ; it is not
entirely safe to run gc-compaction over the metadata key range due to
tombstones and implications of image layers (missing key in image layer
== key not exist). The auto gc-compaction trigger already skips metadata
key ranges (see `schedule_auto_compaction` call in
`trigger_auto_compaction`). In this patch we enforce it directly in
gc_compact_inner so that compactions triggered via HTTP API will also be
subject to this restriction.

## Summary of changes

Ensure gc-compaction only runs on rel key ranges.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-16 21:52:18 +00:00
Aleksandr Sarantsev
1178f6fe7c pageserver: Downgrade log level of 'No broker updates' (#12627)
## Problem

The warning message was seen during deployment, but it's actually OK.

## Summary of changes

- Treat `"No broker updates received for a while ..."` as an info
message.

Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
2025-07-16 15:02:01 +00:00
Arpad Müller
5c934efb29 Don't depend on the postgres_ffi just for one type (#12610)
We don't want to depend on postgres_ffi in an API crate. If there is no
such dependency, we can compile stuff like `storcon_cli` without needing
a full working postgres build. Fixes regression of #12548 (before we
could compile it).
2025-07-15 17:28:08 +00:00
Heikki Linnakangas
5c9c3b3317 Misc cosmetic cleanups (#12598)
- Remove a few obsolete "allowed error messages" from tests. The
pageserver doesn't emit those messages anymore.

- Remove misplaced and outdated docstring comment from
`test_tenants.py`. A docstring is supposed to be the first thing in a
function, but we had added some code before it. And it was outdated, as
we haven't supported running without safekeepers for a long time.

- Fix misc typos in comments

- Remove obsolete comment about backwards compatibility with safekeepers
without `TIMELINE_STATUS` API. All safekeepers have it by now.
2025-07-15 14:36:28 +00:00
Erik Grinaker
367d96e25b Merge branch 'main' into communicator-rewrite 2025-07-14 18:47:23 +02:00
Vlad Lazar
f8d3f86f58 pageserver: include records in get page debug handler (#12578)
Include records and image in the debug get page handler.
This endpoint does not update the metrics and does not support tracing.

Note that this now returns individual bytes which need to be encoded
properly for debugging.

Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
2025-07-14 16:37:28 +00:00
HaoyuHuang
f67a8a173e A few SK changes (#12577)
# TLDR 
This PR is a no-op. 

## Problem
When a SK loses a disk, it must recover all WALs from the very
beginning. This may take days/weeks to catch up to the latest WALs for
all timelines it owns.

## Summary of changes
When SK starts up,
if it finds that it has 0 timelines,
- it will ask SC for the timeline it owns.
- Then, pulls the timeline from its peer safekeepers to restore the WAL
redundancy right away.

After pulling timeline is complete, it will become active and accepts
new WALs.

The current impl is a prototype. We can optimize the impl further, e.g.,
parallel pull timelines.

---------

Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
2025-07-14 16:37:04 +00:00
Erik Grinaker
eb830fa547 pageserver/client_grpc: use unbounded pools (#12585)
## Problem

The communicator gRPC client currently uses bounded client/stream pools.
This can artificially constrain clients, especially after we remove
pipelining in #12584.

[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.

In the worst case, we'll degenerate to the current libpq state with one
stream per backend, but without the TCP connection overhead. In the
common case we expect significantly lower stream counts due to stream
sharing, driven e.g. by idle backends, LFC hits, read coalescing,
sharding (backends typically only talk to one shard at a time), etc.

Currently, Pageservers rarely serve more than 4000 backend connections,
so we have at least 2 orders of magnitude of headroom.

Touches #11735.
Requires #12584.

## Summary of changes

Remove the pool limits, and restructure the pools.

We still keep a separate bulk pool for Getpage batches of >4 pages (>32
KB), with fewer streams per connection. This reduces TCP-level
congestion and head-of-line blocking for non-bulk requests, and
concentrates larger window sizes on a smaller set of
streams/connections, presumably reducing memory usage. Apart from this,
bulk requests don't have any latency penalty compared to other requests.
2025-07-14 13:22:38 +00:00
Erik Grinaker
a203f9829a pageserver: add timeline_id span when freezing layers (#12572)
## Problem

We don't log the timeline ID when rolling ephemeral layers during
housekeeping.

Resolves [LKB-179](https://databricks.atlassian.net/browse/LKB-179)

## Summary of changes

Add a span with timeline ID when calling `maybe_freeze_ephemeral_layer`
from the housekeeping loop.

We don't instrument the function itself, since future callers may not
have a span including the tenant_id already, but we don't want to
duplicate the tenant_id for these spans.
2025-07-14 12:30:28 +00:00
Erik Grinaker
42ab34dc36 pageserver/client_grpc: don't pipeline GetPage requests (#12584)
## Problem

The communicator gRPC client currently attempts to pipeline GetPage
requests from multiple callers onto the same gRPC stream. This has a
number of issues:

* Head-of-line blocking: the request may block on e.g. layer download or
LSN wait, delaying the next request.
* Cancellation: we can't easily cancel in-progress requests (e.g. due to
timeout or backend termination), so it may keep blocking the next
request (even its own retry).
* Complex stream scheduling: picking a stream becomes harder/slower, and
additional Tokio tasks and synchronization is needed for stream
management.

Touches #11735.
Requires #12579.

## Summary of changes

This patch removes pipelining of gRPC stream requests, and instead
prefers to scale out the number of streams to achieve the same
throughput. Stream scheduling has been rewritten, and mostly follows the
same pattern as the client pool with exclusive acquisition by a single
caller.

[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.

This has a number of advantages:

* It (mostly) eliminates head-of-line blocking (except at the TCP
level).
* Cancellation becomes trivial, by closing the stream.
* Stream scheduling becomes significantly simpler and cheaper.
* Individual callers can still use client-side batching for pipelining.
2025-07-14 12:11:33 +00:00
Erik Grinaker
30b877074c pagebench: add CPU profiling support (#12478)
## Problem

The new communicator gRPC client has significantly worse Pagebench
performance than a basic gRPC client. We need to find out why.

## Summary of changes

Add a `pagebench --profile` flag which takes a client CPU profile of the
benchmark and writes a flamegraph to `profile.svg`.
2025-07-14 11:44:53 +00:00
Erik Grinaker
f18cc808f0 pageserver/client_grpc: reap idle channels immediately (#12587)
## Problem

It can take 3x the idle timeout to reap a channel. We have to wait for
the idle timeout to trigger first for the stream, then the client, then
the channel.

Touches #11735.

## Summary of changes

Reap empty channels immediately, and rely indirectly on the
channel/stream timeouts.

This can still lead to 2x the idle timeout for streams (first stream
then client), but that's okay -- if the stream closes abruptly (e.g. due
to timeout or error) we want to keep the client around in the pool for a
while.
2025-07-14 10:47:26 +00:00
Erik Grinaker
d14d8271b8 pageserver/client_grpc: improve retry logic (#12579)
## Problem

gRPC client retries currently include pool acquisition under the
per-attempt timeout. If pool acquisition is slow (e.g. full pool), this
will cause spurious timeout warnings, and the caller will lose its place
in the pool queue.

Touches #11735.

## Summary of changes

Makes several improvements to retries and related logic:

* Don't include pool acquisition time under request timeouts.
* Move attempt timeouts out of `Retry` and into the closure.
* Make `Retry` configurable, move constants into main module.
* Don't backoff on the first retry, and reduce initial/max backoffs to
5ms and 5s respectively.
* Add `with_retries` and `with_timeout` helpers.
* Add slow logging for pool acquisition, and a `warn_slow` counterpart
to `log_slow`.
* Add debug logging for requests and responses at the client boundary.
2025-07-14 10:43:10 +00:00
Erik Grinaker
fecb707b19 pagebench: add idle-streams (#12583)
## Problem

For the communicator scheduling policy, we need to understand the
server-side cost of idle gRPC streams.

Touches #11735.

## Summary of changes

Add an `idle-streams` benchmark to `pagebench` which opens a large
number of idle gRPC GetPage streams.
2025-07-14 09:41:58 +00:00
Erik Grinaker
87f01a25ab pageserver/client_grpc: reap idle channels immediately 2025-07-13 18:44:05 +02:00
Erik Grinaker
56eb511618 pageserver/client_grpc: use unbounded pools 2025-07-13 13:29:27 +02:00
Erik Grinaker
ddeb3f3ed3 pageserver/client_grpc: don't pipeline GetPage requests 2025-07-13 12:24:17 +02:00
Heikki Linnakangas
69dbad700c Merge remote-tracking branch 'origin/main' into HEAD 2025-07-12 16:43:57 +03:00
Erik Grinaker
0d5f4dd979 pageserver/client_grpc: improve retry logic 2025-07-12 12:41:11 +02:00
HaoyuHuang
cb991fba42 A few more PS changes (#12552)
# TLDR
Problem-I is a bug fix. The rest are no-ops. 

## Problem I
Page server checks image layer creation based on the elapsed time but
this check depends on the current logical size, which is only computed
on shard 0. Thus, for non-0 shards, the check will be ineffective and
image creation will never be done for idle tenants.

## Summary of changes I
This PR fixes the problem by simply removing the dependency on current
logical size.

## Summary of changes II
This PR adds a timeout when calling page server to split shard to make
sure SC does not wait for the API call forever. Currently the PR doesn't
adds any retry logic because it's not clear whether page server shard
split can be safely retried if the existing operation is still ongoing
or left the storage in a bad state. Thus it's better to abort the whole
operation and restart.

## Problem III
`test_remote_failures` requires PS to be compiled in the testing mode.
For PS in dev/staging, they are compiled without this mode.

## Summary of changes III
Remove the restriction and also increase the number of total failures
allowed.

## Summary of changes IV
remove test on PS getpage http route.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-11 19:27:55 +00:00
Matthias van de Meent
4566b12a22 NEON: Finish Zenith->Neon rename (#12566)
Even though we're now part of Databricks, let's at least make this part
consistent.

## Summary of changes

- PG14: https://github.com/neondatabase/postgres/pull/669
- PG15: https://github.com/neondatabase/postgres/pull/670
- PG16: https://github.com/neondatabase/postgres/pull/671
- PG17: https://github.com/neondatabase/postgres/pull/672

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2025-07-11 18:56:39 +00:00
Alex Chi Z.
63ca084696 fix(pageserver): downgrade wal apply error during gc-compaction (#12518)
## Problem

close LKB-162

close https://github.com/neondatabase/cloud/issues/30665, related to
https://github.com/neondatabase/cloud/issues/29434

We see a lot of errors like:

```
2025-05-22T23:06:14.928959Z ERROR compaction_loop{tenant_id=? shard_id=0304}:run:gc_compact_timeline{timeline_id=?}: error applying 4 WAL records 35/DC0DF0B8..3B/E43188C0 (8119 bytes) to key 000000067F0000400500006027000000B9D0, from base image with LSN 0/0 to reconstruct page image at LSN 61/150B9B20 n_attempts=0: apply_wal_records

Caused by:
    0: read walredo stdout
    1: early eof
```

which is an acceptable form of error and we should downgrade it to
warning.

## Summary of changes

walredo error during gc-compaction is expected when the data below the
gc horizon does not contain a full key history. This is possible in some
rare cases of gc that is only able to remove data in the middle of the
history but not all earlier history when a full keyspace gets deleted.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-11 18:37:55 +00:00
Vlad Lazar
154f6dc59c pageserver: log only on final shard resolution failure (#12565)
This log is too noisy. Instead of warning on every retry, let's log only
on the final failure.
2025-07-11 13:25:25 +00:00
Vlad Lazar
15f633922a pageserver: use image consistent LSN for force image layer creation (#12547)
This is a no-op for the neon deployment

* Introduce the concept image consistent lsn: of the largest LSN below
which all pages have been redone successfully
* Use the image consistent LSN for forced image layer creations
* Optionally expose the image consistent LSN via the timeline describe
HTTP endpoint
* Add a sharded timeline describe endpoint to storcon

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-11 11:39:51 +00:00
Erik Grinaker
8cd5370c00 Merge branch 'main' into communicator-rewrite 2025-07-11 10:39:26 +02:00
Erik Grinaker
8aa9540a05 pageserver/page_api: include block number and rel in gRPC GetPageResponse (#12542)
## Problem

With gRPC `GetPageRequest` batches, we'll have non-trivial
fragmentation/reassembly logic in several places of the stack
(concurrent reads, shard splits, LFC hits, etc). If we included the
block numbers with the pages in `GetPageResponse` we could have better
verification and observability that the final responses are correct.

Touches #11735.
Requires #12480.

## Summary of changes

Add a `Page` struct with`block_number` for `GetPageResponse`, along with
the `RelTag` for completeness, and verify them in the rich gRPC client.
2025-07-10 22:35:14 +00:00
Erik Grinaker
44ea17b7b2 pageserver/page_api: add attempt to GetPage request ID (#12536)
## Problem

`GetPageRequest::request_id` is supposed to be a unique ID for a
request. It's not, because we may retry the request using the same ID.
This causes assertion failures and confusion.

Touches #11735.
Requires #12480.

## Summary of changes

Extend the request ID with a retry attempt, and handle it in the gRPC
client and server.
2025-07-10 20:39:42 +00:00
Erik Grinaker
dcdfe80bf0 pagebench: add support for rich gRPC client (#12477)
## Problem

We need to benchmark the rich gRPC client
`client_grpc::PageserverClient` against the basic, no-frills
`page_api::Client` to determine how much overhead it adds.

Touches #11735.
Requires #12476.

## Summary of changes

Add a `pagebench --rich-client` parameter to use
`client_grpc::PageserverClient`. Also adds a compression parameter to
the client.
2025-07-10 17:30:09 +00:00
Erik Grinaker
2fc77c836b pageserver/client_grpc: add shard map updates (#12480)
## Problem

The communicator gRPC client must support changing the shard map on
splits.

Touches #11735.
Requires #12476.

## Summary of changes

* Wrap the shard set in a `ArcSwap` to allow swapping it out.
* Add a new `ShardSpec` parameter struct to pass validated shard info to
the client.
* Add `update_shards()` to change the shard set. In-flight requests are
allowed to complete using the old shards.
* Restructure `get_page` to use a stable view of the shard map, and
retry errors at the top (pre-split) level to pick up shard map changes.
* Also marks `tonic::Status::Internal` as non-retryable, so that we can
use it for client-side invariant checks without continually retrying
these.
2025-07-10 15:46:39 +00:00
HaoyuHuang
2c6b327be6 A few PS changes (#12540)
# TLDR
All changes are no-op except some metrics. 

## Summary of changes I
### Pageserver
Added a new global counter metric
`pageserver_pagestream_handler_results_total` that categorizes
pagestream request results according to their outcomes:
1. Success
2. Internal errors
3. Other errors

Internal errors include:
1. Page reconstruction error: This probably indicates a pageserver
bug/corruption
2. LSN timeout error: Could indicate overload or bugs with PS's ability
to reach other components
3. Misrouted request error: Indicates bugs in the Storage Controller/HCC

Other errors include transient errors that are expected during normal
operation or errors indicating bugs with other parts of the system
(e.g., malformed requests, errors due to cancelled operations during PS
shutdown, etc.)    


## Summary of changes II
This PR adds a pageserver endpoint and its counterpart in storage
controller to list visible size of all tenant shards. This will be a
prerequisite of the tenant rebalance command.


## Problem III
We need a way to download WAL
segments/layerfiles from S3 and replay WAL records. We cannot access
production S3 from our laptops directly, and we also can't transfer any
user data out of production systems for GDPR compliance, so we need
solutions.

## Summary of changes III

This PR adds a couple of tools to support the debugging
workflow in production:
1. A new `pagectl download-remote-object` command that can be used to
download remote storage objects assuming the correct access is set up.

## Summary of changes IV
This PR adds a command to list all visible delta and image layers from
index_part. This is useful to debug compaction issues as index_part
often contain a lot of covered layers due to PITR.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-10 14:39:38 +00:00
Vlad Lazar
ffeede085e libs: move metric collection for pageserver and safekeeper in a background task (#12525)
## Problem

Safekeeper and pageserver metrics collection might time out. We've seen
this in both hadron and neon.

## Summary of changes

This PR moves metrics collection in PS/SK to the background so that we
will always get some metrics, despite there may be some delays. Will
leave it to the future work to reduce metrics collection time.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-10 11:58:22 +00:00
Erik Grinaker
f4b03ddd7b pageserver/client_grpc: reap idle pool resources (#12476)
## Problem

The gRPC client pools don't reap idle resources.

Touches #11735.
Requires #12475.

## Summary of changes

Reap idle pool resources (channels/clients/streams) after 3 minutes of
inactivity.

Also restructure the `StreamPool` to use a mutex rather than atomics for
synchronization, for simplicity. This will be optimized later.
2025-07-10 10:18:37 +00:00
Vlad Lazar
08b19f001c pageserver: optionally force image layer creation on timeout (#12529)
This PR introduces a `image_creation_timeout` to page servers so that we
can force the image creation after a certain period. This is set to 1
day on dev/staging for now, and will rollout to production 1/2 weeks
later.

Majority of the PR are boilerplate code to add the new knob. Specific
changes of the PR are:
1. During L0 compaction, check if we should force a compaction if
min(LSN) of all delta layers < force_image_creation LSN.
2. During image creation, check if we should force a compaction if the
image's LSN < force_image_creation LSN and there are newer deltas with
overlapping key ranges.
3. Also tweaked the check image creation interval to make sure we honor
image_creation_timeout.

Vlad's note: This should be a no-op. I added an extra PS config for the
large timeline
threshold to enable this.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-10 10:07:21 +00:00
Christian Schwarz
2edd59aefb impr(compaction): unify checking of CompactionError for cancellation reason (#12392)
There are a couple of places that call `CompactionError::is_cancel` but
don't check the `::Other` variant via downcasting for root cause being
cancellation.
The only place that does it is `log_compaction_error`.
It's sad we have to do it, but, until we get around cleaning up all the
culprits,
a step forward is to unify the behavior so that all places that inspect
a
`CompactionError` for cancellation reason follow the same behavior.

Thus, this PR ...
- moves the downcasting checks against the `::Other` variant from
  `log_compaction_error` into `is_cancel()` and
- enforces via type system that `.is_cancel()` is used to check whether
  a CompactionError is due to cancellation (matching on the
  `CompactionError::ShuttingDown` will cause a compile-time error).

I don't think there's a _serious_ case right now where matching instead
of using `is_cancel` causes problems.
The worst case I could find is the circuit breaker and
`compaction_failed`,
which don't really matter if we're shutting down the timeline anyway.
But it's unaesthetic and might cause log/alert noise down the line,
so, this PR fixes that at least.

Refs
- https://databricks.atlassian.net/browse/LKB-182
- slack conversation about this PR:
https://databricks.slack.com/archives/C09254R641L/p1751284317955159
2025-07-09 21:15:44 +00:00
Vlad Lazar
fe0ddb7169 libs: make remote storage failure injection probabilistic (#12526)
Change the unreliable storage wrapper to fail by probability when there
are more failure attempts left.

Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
2025-07-09 17:41:34 +00:00
Erik Grinaker
2f71eda00f pageserver/client_grpc: add separate pools for bulk requests (#12475)
## Problem

GetPage bulk requests such as prefetches and vacuum can head-of-line
block foreground requests, causing increased latency.

Touches #11735.
Requires #12469.

## Summary of changes

* Use dedicated channel/client/stream pools for bulk GetPage requests.
* Use lower concurrency but higher queue depth for bulk pools.
* Make pool limits configurable.
* Require unbounded client pool for stream pool, to avoid accidental
starvation.
2025-07-09 16:12:59 +00:00
Alex Chi Z.
5ec82105cc fix(pageserver): ensure remote size gets computed (#12520)
## Problem

Follow up of #12400 

## Summary of changes

We didn't set remote_size_mb to Some when initialized so it never gets
computed :(

Also added a new API to force refresh the properties.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-09 15:35:19 +00:00
Mikhail
bc6a756f1c ci: lint openapi specs using redocly (#12510)
We need to lint specs for pageserver, endpoint storage, and safekeeper
#0000
2025-07-09 14:29:45 +00:00
Erik Grinaker
8f3351fa91 pageserver/client_grpc: split GetPage batches across shards (#12469)
## Problem

The rich gRPC Pageserver client needs to split GetPage batches that
straddle multiple shards.

Touches #11735.
Requires #12462.

## Summary of changes

Adds a `GetPageSplitter` which splits `GetPageRequest` that span
multiple shards, and then reassembles the responses. Dispatches
per-shard requests in parallel.
2025-07-09 14:17:22 +00:00
Heikki Linnakangas
8db138ef64 Plumb through the stripe size to the communicator 2025-07-09 16:18:26 +03:00
Erik Grinaker
3915995530 pageserver/client_grpc: add rich Pageserver gRPC client (#12462)
## Problem

For the communicator, we need a rich Pageserver gRPC client.

Touches #11735.
Requires #12434.

## Summary of changes

This patch adds an initial rich Pageserver gRPC client. It supports:

* Sharded tenants across multiple Pageservers.
* Pooling of connections, clients, and streams for efficient resource
use.
* Concurrent use by many callers.
* Internal handling of GetPage bidirectional streams, with pipelining
and error handling.
* Automatic retries.
* Observability.

The client is still under development. In particular, it needs GetPage
batch splitting, shard map updates, and performance optimization. This
will be addressed in follow-up PRs.
2025-07-09 11:42:46 +00:00
Christian Schwarz
aac1f8efb1 refactor(compaction): eliminate CompactionError::CollectKeyspaceError variant (#12517)
The only differentiated handling of it is for `is_critical`, which in
turn is a `matches!()` on several variants of the `enum
CollectKeyspaceError`
which is the value contained insided
`CompactionError::CollectKeyspaceError`.

This PR introduces a new error for `repartition()`, allowing its
immediate
callers to inspect it like `is_critical` did.

A drive-by fix is more precise classification of WaitLsnError::BadState
when mapping to `tonic::Status`.

refs
- https://databricks.atlassian.net/browse/LKB-182
2025-07-09 08:41:36 +00:00
Erik Grinaker
08399672be Temporary workaround for timeout retry errors 2025-07-09 09:49:15 +02:00
Alex Chi Z.
43dbded8c8 fix(pageserver): disallow lease creation below the applied gc cutoff (#12489)
## Problem

close LKB-209

## Summary of changes

- We should not allow lease creation below the applied gc cutoff.
- Also removed the condition for `AttachedSingle`. We should always
check the lease against the gc cutoff in all attach modes.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-08 22:32:51 +00:00