The LSN lease api actually accepts a tenant_shard_id, not a tenant_id.
But we put the Display of the tenant_shard_id into the tenant_id field.
This PR fixes it.
Refs
- fixes https://databricks.atlassian.net/browse/LKB-2930
## Problem
Post LKB-198 rollout. We added a new strategy to generate image layers
at the L0-L1 boundary instead of the latest LSN to ensure too many L0
layers do not trigger image layer creation.
## Summary of changes
We already rolled it out to all users so we can remove the feature flag
now.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
NB: effectively a no-op in the neon env since the handling is config
gated
in storcon
## Problem
When a pageserver suffers from a local disk/node failure and restarts,
the storage controller will receive a re-attach call and return all the
tenants the pageserver is suppose to attach, but the pageserver will not
act on any tenants that it doesn't know about locally. As a result, the
pageserver will not rehydrate any tenants from remote storage if it
restarted following a local disk loss, while the storage controller
still thinks that the pageserver have all the tenants attached. This
leaves the system in a bad state, and the symptom is that PG's
pageserver connections will fail with "tenant not found" errors.
## Summary of changes
Made a slight change to the storage controller's `re_attach` API:
* The pageserver will set an additional bit `empty_local_disk` in the
reattach request, indicating whether it has started with an empty disk
or does not know about any tenants.
* Upon receiving the reattach request, if this `empty_local_disk` bit is
set, the storage controller will go ahead and clear all observed
locations referencing the pageserver. The reconciler will then discover
the discrepancy between the intended state and observed state of the
tenant and take care of the situation.
To facilitate rollouts this extra behavior in the `re_attach` API is
guarded by the `handle_ps_local_disk_loss` command line flag of the
storage controller.
---------
Co-authored-by: William Huang <william.huang@databricks.com>
Include the ip address (optionally read from an env var) in the
pageserver's registration request.
Note that the ip address is ignored by the storage controller at the
moment, which makes it a no-op
in the neon env.
## Problem
We've had bugs where the compute would use the stale default stripe size
from an unsharded tenant after the tenant split with a new stripe size.
## Summary of changes
Never specify a stripe size for unsharded tenants, to guard against
misuse. Only specify it once tenants are sharded and the stripe size
can't change.
Also opportunistically changes `GetPageSplitter` to return
`anyhow::Result`, since we'll be using this in other code paths as well
(specifically during server-side shard splits).
## Problem
Postgres will often immediately follow a relation existence check with a
relation size query. This incurs two roundtrips, and may prevent
effective caching.
See [Slack
thread](https://databricks.slack.com/archives/C091SDX74SC/p1751951732136139).
Touches #11728.
## Summary of changes
For the gRPC API:
* Add an `allow_missing` parameter to `GetRelSize`, which returns
`missing=true` instead of a `NotFound` error.
* Remove `CheckRelExists`.
There are no changes to libpq behavior.
## Problem
`ShardStripeSize` will be used in the compute spec and internally in the
communicator. It shouldn't require pulling in all of `pageserver_api`.
## Summary of changes
Move `ShardStripeSize` into `utils::shard`, along with other basic shard
types. Also remove the `Default` implementation, to discourage clients
from falling back to a default (it's generally a footgun).
The type is still re-exported from `pageserver_api::shard`, along with
all the other shard types.
## Problem
The gRPC page service does not properly react to shutdown cancellation.
In particular, Tonic considers an open GetPage stream to be an in-flight
request, so it will wait for it to complete before shutting down.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
## Summary of changes
Properly react to the server's cancellation token and take out gate
guards in gRPC request handlers.
Also document cancellation handling. In particular, that Tonic will drop
futures when clients go away (e.g. on timeout or shutdown), so the read
path must be cancellation-safe. It is believed to be (modulo possible
logging noise), but this will be verified later.
## Problem
part of https://github.com/neondatabase/neon/issues/11318 ; it is not
entirely safe to run gc-compaction over the metadata key range due to
tombstones and implications of image layers (missing key in image layer
== key not exist). The auto gc-compaction trigger already skips metadata
key ranges (see `schedule_auto_compaction` call in
`trigger_auto_compaction`). In this patch we enforce it directly in
gc_compact_inner so that compactions triggered via HTTP API will also be
subject to this restriction.
## Summary of changes
Ensure gc-compaction only runs on rel key ranges.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The warning message was seen during deployment, but it's actually OK.
## Summary of changes
- Treat `"No broker updates received for a while ..."` as an info
message.
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
We don't want to depend on postgres_ffi in an API crate. If there is no
such dependency, we can compile stuff like `storcon_cli` without needing
a full working postgres build. Fixes regression of #12548 (before we
could compile it).
- Remove a few obsolete "allowed error messages" from tests. The
pageserver doesn't emit those messages anymore.
- Remove misplaced and outdated docstring comment from
`test_tenants.py`. A docstring is supposed to be the first thing in a
function, but we had added some code before it. And it was outdated, as
we haven't supported running without safekeepers for a long time.
- Fix misc typos in comments
- Remove obsolete comment about backwards compatibility with safekeepers
without `TIMELINE_STATUS` API. All safekeepers have it by now.
Include records and image in the debug get page handler.
This endpoint does not update the metrics and does not support tracing.
Note that this now returns individual bytes which need to be encoded
properly for debugging.
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
# TLDR
This PR is a no-op.
## Problem
When a SK loses a disk, it must recover all WALs from the very
beginning. This may take days/weeks to catch up to the latest WALs for
all timelines it owns.
## Summary of changes
When SK starts up,
if it finds that it has 0 timelines,
- it will ask SC for the timeline it owns.
- Then, pulls the timeline from its peer safekeepers to restore the WAL
redundancy right away.
After pulling timeline is complete, it will become active and accepts
new WALs.
The current impl is a prototype. We can optimize the impl further, e.g.,
parallel pull timelines.
---------
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
## Problem
The communicator gRPC client currently uses bounded client/stream pools.
This can artificially constrain clients, especially after we remove
pipelining in #12584.
[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.
In the worst case, we'll degenerate to the current libpq state with one
stream per backend, but without the TCP connection overhead. In the
common case we expect significantly lower stream counts due to stream
sharing, driven e.g. by idle backends, LFC hits, read coalescing,
sharding (backends typically only talk to one shard at a time), etc.
Currently, Pageservers rarely serve more than 4000 backend connections,
so we have at least 2 orders of magnitude of headroom.
Touches #11735.
Requires #12584.
## Summary of changes
Remove the pool limits, and restructure the pools.
We still keep a separate bulk pool for Getpage batches of >4 pages (>32
KB), with fewer streams per connection. This reduces TCP-level
congestion and head-of-line blocking for non-bulk requests, and
concentrates larger window sizes on a smaller set of
streams/connections, presumably reducing memory usage. Apart from this,
bulk requests don't have any latency penalty compared to other requests.
## Problem
We don't log the timeline ID when rolling ephemeral layers during
housekeeping.
Resolves [LKB-179](https://databricks.atlassian.net/browse/LKB-179)
## Summary of changes
Add a span with timeline ID when calling `maybe_freeze_ephemeral_layer`
from the housekeeping loop.
We don't instrument the function itself, since future callers may not
have a span including the tenant_id already, but we don't want to
duplicate the tenant_id for these spans.