18 Commits

Author SHA1 Message Date
Erik Grinaker
7cd0066212 page_api: add SplitError for GetPageSplitter (#12709)
Add a `SplitError` for `GetPageSplitter`, with an `Into<tonic::Status>`
implementation. This avoids a bunch of boilerplate to convert
`GetPageSplitter` errors into `tonic::Status`.

Requires #12702.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
2025-07-29 18:26:20 +00:00
Erik Grinaker
65d1be6e90 pageserver: route gRPC requests to child shards (#12702)
## Problem

During shard splits, each parent shard is split and removed
incrementally. Only when all parent shards have split is the split
committed and the compute notified. This can take several minutes for
large tenants. In the meanwhile, the compute will be sending requests to
the (now-removed) parent shards.

This was (mostly) not a problem for the libpq protocol, because it does
shard routing on the server-side. The compute just sends requests to
some Pageserver, and the server will figure out which local shard should
serve it.

It is a problem for the gRPC protocol, where the client explicitly says
which shard it's talking to.

Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
Requires #12772.

## Summary of changes

* Add server-side routing of gRPC requests to any local child shards if
the parent does not exist.
* Add server-side splitting of GetPage batch requests straddling
multiple child shards.
* Move the `GetPageSplitter` into `pageserver_page_api`.

I really don't like this approach, but it avoids making changes to the
split protocol. I could be convinced we should change the split protocol
instead, e.g. to keep the parent shard alive until the split commits and
the compute has been notified, but we can also do that as a later change
without blocking the communicator on it.
2025-07-29 16:28:57 +00:00
Erik Grinaker
5a48365fb9 pageserver/client_grpc: don't set stripe size for unsharded tenants (#12639)
## Problem

We've had bugs where the compute would use the stale default stripe size
from an unsharded tenant after the tenant split with a new stripe size.

## Summary of changes

Never specify a stripe size for unsharded tenants, to guard against
misuse. Only specify it once tenants are sharded and the stripe size
can't change.

Also opportunistically changes `GetPageSplitter` to return
`anyhow::Result`, since we'll be using this in other code paths as well
(specifically during server-side shard splits).
2025-07-21 12:28:39 +00:00
Erik Grinaker
194b9ffc41 pageserver: remove gRPC CheckRelExists (#12616)
## Problem

Postgres will often immediately follow a relation existence check with a
relation size query. This incurs two roundtrips, and may prevent
effective caching.

See [Slack
thread](https://databricks.slack.com/archives/C091SDX74SC/p1751951732136139).

Touches #11728.

## Summary of changes

For the gRPC API:

* Add an `allow_missing` parameter to `GetRelSize`, which returns
`missing=true` instead of a `NotFound` error.
* Remove `CheckRelExists`.

There are no changes to libpq behavior.
2025-07-21 11:43:26 +00:00
Erik Grinaker
e181b996c3 utils: move ShardStripeSize into shard module (#12640)
## Problem

`ShardStripeSize` will be used in the compute spec and internally in the
communicator. It shouldn't require pulling in all of `pageserver_api`.

## Summary of changes

Move `ShardStripeSize` into `utils::shard`, along with other basic shard
types. Also remove the `Default` implementation, to discourage clients
from falling back to a default (it's generally a footgun).

The type is still re-exported from `pageserver_api::shard`, along with
all the other shard types.
2025-07-21 10:56:20 +00:00
Erik Grinaker
eb830fa547 pageserver/client_grpc: use unbounded pools (#12585)
## Problem

The communicator gRPC client currently uses bounded client/stream pools.
This can artificially constrain clients, especially after we remove
pipelining in #12584.

[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.

In the worst case, we'll degenerate to the current libpq state with one
stream per backend, but without the TCP connection overhead. In the
common case we expect significantly lower stream counts due to stream
sharing, driven e.g. by idle backends, LFC hits, read coalescing,
sharding (backends typically only talk to one shard at a time), etc.

Currently, Pageservers rarely serve more than 4000 backend connections,
so we have at least 2 orders of magnitude of headroom.

Touches #11735.
Requires #12584.

## Summary of changes

Remove the pool limits, and restructure the pools.

We still keep a separate bulk pool for Getpage batches of >4 pages (>32
KB), with fewer streams per connection. This reduces TCP-level
congestion and head-of-line blocking for non-bulk requests, and
concentrates larger window sizes on a smaller set of
streams/connections, presumably reducing memory usage. Apart from this,
bulk requests don't have any latency penalty compared to other requests.
2025-07-14 13:22:38 +00:00
Erik Grinaker
42ab34dc36 pageserver/client_grpc: don't pipeline GetPage requests (#12584)
## Problem

The communicator gRPC client currently attempts to pipeline GetPage
requests from multiple callers onto the same gRPC stream. This has a
number of issues:

* Head-of-line blocking: the request may block on e.g. layer download or
LSN wait, delaying the next request.
* Cancellation: we can't easily cancel in-progress requests (e.g. due to
timeout or backend termination), so it may keep blocking the next
request (even its own retry).
* Complex stream scheduling: picking a stream becomes harder/slower, and
additional Tokio tasks and synchronization is needed for stream
management.

Touches #11735.
Requires #12579.

## Summary of changes

This patch removes pipelining of gRPC stream requests, and instead
prefers to scale out the number of streams to achieve the same
throughput. Stream scheduling has been rewritten, and mostly follows the
same pattern as the client pool with exclusive acquisition by a single
caller.

[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.

This has a number of advantages:

* It (mostly) eliminates head-of-line blocking (except at the TCP
level).
* Cancellation becomes trivial, by closing the stream.
* Stream scheduling becomes significantly simpler and cheaper.
* Individual callers can still use client-side batching for pipelining.
2025-07-14 12:11:33 +00:00
Erik Grinaker
f18cc808f0 pageserver/client_grpc: reap idle channels immediately (#12587)
## Problem

It can take 3x the idle timeout to reap a channel. We have to wait for
the idle timeout to trigger first for the stream, then the client, then
the channel.

Touches #11735.

## Summary of changes

Reap empty channels immediately, and rely indirectly on the
channel/stream timeouts.

This can still lead to 2x the idle timeout for streams (first stream
then client), but that's okay -- if the stream closes abruptly (e.g. due
to timeout or error) we want to keep the client around in the pool for a
while.
2025-07-14 10:47:26 +00:00
Erik Grinaker
d14d8271b8 pageserver/client_grpc: improve retry logic (#12579)
## Problem

gRPC client retries currently include pool acquisition under the
per-attempt timeout. If pool acquisition is slow (e.g. full pool), this
will cause spurious timeout warnings, and the caller will lose its place
in the pool queue.

Touches #11735.

## Summary of changes

Makes several improvements to retries and related logic:

* Don't include pool acquisition time under request timeouts.
* Move attempt timeouts out of `Retry` and into the closure.
* Make `Retry` configurable, move constants into main module.
* Don't backoff on the first retry, and reduce initial/max backoffs to
5ms and 5s respectively.
* Add `with_retries` and `with_timeout` helpers.
* Add slow logging for pool acquisition, and a `warn_slow` counterpart
to `log_slow`.
* Add debug logging for requests and responses at the client boundary.
2025-07-14 10:43:10 +00:00
Erik Grinaker
8aa9540a05 pageserver/page_api: include block number and rel in gRPC GetPageResponse (#12542)
## Problem

With gRPC `GetPageRequest` batches, we'll have non-trivial
fragmentation/reassembly logic in several places of the stack
(concurrent reads, shard splits, LFC hits, etc). If we included the
block numbers with the pages in `GetPageResponse` we could have better
verification and observability that the final responses are correct.

Touches #11735.
Requires #12480.

## Summary of changes

Add a `Page` struct with`block_number` for `GetPageResponse`, along with
the `RelTag` for completeness, and verify them in the rich gRPC client.
2025-07-10 22:35:14 +00:00
Erik Grinaker
44ea17b7b2 pageserver/page_api: add attempt to GetPage request ID (#12536)
## Problem

`GetPageRequest::request_id` is supposed to be a unique ID for a
request. It's not, because we may retry the request using the same ID.
This causes assertion failures and confusion.

Touches #11735.
Requires #12480.

## Summary of changes

Extend the request ID with a retry attempt, and handle it in the gRPC
client and server.
2025-07-10 20:39:42 +00:00
Erik Grinaker
dcdfe80bf0 pagebench: add support for rich gRPC client (#12477)
## Problem

We need to benchmark the rich gRPC client
`client_grpc::PageserverClient` against the basic, no-frills
`page_api::Client` to determine how much overhead it adds.

Touches #11735.
Requires #12476.

## Summary of changes

Add a `pagebench --rich-client` parameter to use
`client_grpc::PageserverClient`. Also adds a compression parameter to
the client.
2025-07-10 17:30:09 +00:00
Erik Grinaker
2fc77c836b pageserver/client_grpc: add shard map updates (#12480)
## Problem

The communicator gRPC client must support changing the shard map on
splits.

Touches #11735.
Requires #12476.

## Summary of changes

* Wrap the shard set in a `ArcSwap` to allow swapping it out.
* Add a new `ShardSpec` parameter struct to pass validated shard info to
the client.
* Add `update_shards()` to change the shard set. In-flight requests are
allowed to complete using the old shards.
* Restructure `get_page` to use a stable view of the shard map, and
retry errors at the top (pre-split) level to pick up shard map changes.
* Also marks `tonic::Status::Internal` as non-retryable, so that we can
use it for client-side invariant checks without continually retrying
these.
2025-07-10 15:46:39 +00:00
Erik Grinaker
f4b03ddd7b pageserver/client_grpc: reap idle pool resources (#12476)
## Problem

The gRPC client pools don't reap idle resources.

Touches #11735.
Requires #12475.

## Summary of changes

Reap idle pool resources (channels/clients/streams) after 3 minutes of
inactivity.

Also restructure the `StreamPool` to use a mutex rather than atomics for
synchronization, for simplicity. This will be optimized later.
2025-07-10 10:18:37 +00:00
Erik Grinaker
2f71eda00f pageserver/client_grpc: add separate pools for bulk requests (#12475)
## Problem

GetPage bulk requests such as prefetches and vacuum can head-of-line
block foreground requests, causing increased latency.

Touches #11735.
Requires #12469.

## Summary of changes

* Use dedicated channel/client/stream pools for bulk GetPage requests.
* Use lower concurrency but higher queue depth for bulk pools.
* Make pool limits configurable.
* Require unbounded client pool for stream pool, to avoid accidental
starvation.
2025-07-09 16:12:59 +00:00
Erik Grinaker
8f3351fa91 pageserver/client_grpc: split GetPage batches across shards (#12469)
## Problem

The rich gRPC Pageserver client needs to split GetPage batches that
straddle multiple shards.

Touches #11735.
Requires #12462.

## Summary of changes

Adds a `GetPageSplitter` which splits `GetPageRequest` that span
multiple shards, and then reassembles the responses. Dispatches
per-shard requests in parallel.
2025-07-09 14:17:22 +00:00
Erik Grinaker
3915995530 pageserver/client_grpc: add rich Pageserver gRPC client (#12462)
## Problem

For the communicator, we need a rich Pageserver gRPC client.

Touches #11735.
Requires #12434.

## Summary of changes

This patch adds an initial rich Pageserver gRPC client. It supports:

* Sharded tenants across multiple Pageservers.
* Pooling of connections, clients, and streams for efficient resource
use.
* Concurrent use by many callers.
* Internal handling of GetPage bidirectional streams, with pipelining
and error handling.
* Automatic retries.
* Observability.

The client is still under development. In particular, it needs GetPage
batch splitting, shard map updates, and performance optimization. This
will be addressed in follow-up PRs.
2025-07-09 11:42:46 +00:00
Erik Grinaker
8223c1ba9d pageserver/client_grpc: add initial gRPC client pools (#12434)
## Problem

The communicator will need gRPC channel/client/stream pools for
efficient reuse across many backends.

Touches #11735.
Requires #12396.

## Summary of changes

Adds three nested resource pools:

* `ChannelPool` for gRPC channels (i.e. TCP connections).
* `ClientPool` for gRPC clients (i.e. `page_api::Client`). Acquires
channels from `ChannelPool`.
* `StreamPool` for gRPC GetPage streams. Acquires clients from
`ClientPool`.

These are minimal functional implementations that will need further
improvements and performance optimization. However, the overall
structure is expected to be roughly final, so reviews should focus on
that.

The pools are not yet in use, but will form the foundation of a rich
gRPC Pageserver client used by the communicator (see #12462). This PR
also adds the initial crate scaffolding for that client.

See doc comments for details.
2025-07-08 20:58:18 +00:00