## Problem
Neon's storage system currently has hard-coded 512-byte block size for
Direct IO operations, which causes I/O errors on systems with disks that
have 4096-byte block sizes.
This results in errors like "vec read failed" and "Invalid argument (os
error 22)" on certain hardware configurations.
See issue #12623 for details.
## Summary of changes
Make Direct IO alignment configurable at build time to support both
512-byte and 4096-byte block sizes:
- Add `io-align-512` and `io-align-4k` cargo features (default: 512-byte
for backward compatibility)
- Make `DEFAULT_IO_BUFFER_ALIGNMENT` configurable via cargo features in
`pageserver_api`
- Update `DIO_CHUNK_SIZE` in vectored_dio_read to use the configured
alignment value dynamically
- Add `IO_ALIGNMENT` build argument to Dockerfile to allow building
images with different alignment settings
- Add startup logging to display the configured IO buffer alignment for
operational visibility
- Fix validation logic in `virtual_file.rs` to use the configured
alignment instead of hard-coded 512
This change allows Neon to run on systems with different disk block
sizes by building with the appropriate feature flag, addressing the
compatibility issues described in the RFC on Direct IO implementation
## Performance Note
Benchmarks show 512-byte alignment performs significantly better than
4k:
- Write: 512-byte is 21-71% faster across percentiles (p99: 71% faster)
- Read: 512-byte is slightly faster (5-21% improvement)
This is why 512-byte remains the default.
However, some storage systems require 4k alignment and will fail with
EINVAL otherwise. This change adds build-time configuration to support
both environments.
## Problem
We are currently using GCS through the AWS API instead of directly to
the GCS API.
## Summary of changes
Draft implementation of a GCS provider. We run Neon on GCS with the AWS
provider via [this
patch](https://github.com/neondatabase/neon/pull/10277), but want to use
GCS API directly. This implementation attempts to do so without adding a
GCS library dependency or new SDK, except for `gcp_auth`.
… walproposer (#895)
Data corruptions are typically detected on the pageserver side when it
replays WAL records. However, since PS doesn't synchronously replay WAL
records as they are being ingested through safekeepers, we need some
extra plumbing to feed information about pageserver-detected corruptions
during compaction (and/or WAL redo in general) back to SK and PG for
proper action.
We don't yet know what actions PG/SK should take upon receiving the
signal, but we should have the detection and feedback in place.
Add an extra `corruption_detected` field to the `PageserverFeedback`
message that is sent from PS -> SK -> PG. It's a boolean value that is
set to true when PS detects a "critical error" that signals data
corruption, and it's sent in all `PageserverFeedback` messages. Upon
receiving this signal, the safekeeper raises a
`safekeeper_ps_corruption_detected` gauge metric (value set to 1). The
safekeeper then forwards this signal to PG where a
`ps_corruption_detected` gauge metric (value also set to 1) is raised in
the `neon_perf_counters` view.
Added an integration test in
`test_compaction.py::test_ps_corruption_detection_feedback` that
confirms that the safekeeper and PG can receive the data corruption
signal in the `PageserverFeedback` message in a simulated data
corruption.
## Problem
## Summary of changes
---------
Co-authored-by: William Huang <william.huang@databricks.com>
Add a `SplitError` for `GetPageSplitter`, with an `Into<tonic::Status>`
implementation. This avoids a bunch of boilerplate to convert
`GetPageSplitter` errors into `tonic::Status`.
Requires #12702.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
## Problem
During shard splits, each parent shard is split and removed
incrementally. Only when all parent shards have split is the split
committed and the compute notified. This can take several minutes for
large tenants. In the meanwhile, the compute will be sending requests to
the (now-removed) parent shards.
This was (mostly) not a problem for the libpq protocol, because it does
shard routing on the server-side. The compute just sends requests to
some Pageserver, and the server will figure out which local shard should
serve it.
It is a problem for the gRPC protocol, where the client explicitly says
which shard it's talking to.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
Requires #12772.
## Summary of changes
* Add server-side routing of gRPC requests to any local child shards if
the parent does not exist.
* Add server-side splitting of GetPage batch requests straddling
multiple child shards.
* Move the `GetPageSplitter` into `pageserver_page_api`.
I really don't like this approach, but it avoids making changes to the
split protocol. I could be convinced we should change the split protocol
instead, e.g. to keep the parent shard alive until the split commits and
the compute has been notified, but we can also do that as a later change
without blocking the communicator on it.
## Problem
In https://github.com/neondatabase/neon/pull/12467, timeouts and retries
were added to `Cache::get` tenant shard resolution to paper over an
issue with read unavailability during shard splits. However, this
retries _all_ errors, including irrecoverable errors like `NotFound`.
This causes problems with gRPC child shard routing in #12702, which
targets specific shards with `ShardSelector::Known` and relies on prompt
`NotFound` errors to reroute requests to child shards. These retries
introduce a 1s delay for all reads during child routing.
The broader problem of read unavailability during shard splits is left
as future work, see https://databricks.atlassian.net/browse/LKB-672.
Touches #12702.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
## Summary of changes
* Change `TenantManager` to always return a concrete
`GetActiveTimelineError`.
* Only retry `WaitForActiveTimeout` errors.
* Lots of code unindentation due to the simplified error handling.
Out of caution, we do not gate the retries on `ShardSelector`, since
this can trigger other races. Improvements here are left as future work.
## Problem
When tenants have a lot of timelines, the number of tenants that a
pageserver can comfortably handle goes down. Branching is much more
widely used in practice now than it was when this code was written, and
we generally run pageservers with a few thousand tenants (where each
tenant has many timelines), rather than the 10k-20k we might have done
historically.
This should really be something configurable, or a more direct proxy for
resource utilization (such as non-archived timeline count), but this
change should be a low effort improvement.
## Summary of changes
* Change the target shard count (MAX_SHARDS) to 2500 from 5000 when
calculating pageserver utilization (i.e. a 200% overcommit now
corresponds to 5000 shards, not 10000 shards)
Co-authored-by: John Spray <john.spray@databricks.com>
Before this PR, getpage requests wouldn't hold the
`applied_gc_cutoff_lsn` guard until they were done.
Theoretical impact: if we’re not holding the `RcuReadGuard`, gc can
theoretically concurrently delete reconstruct data that we need to
reconstruct the page.
I don't think this practically occurs in production because the odds of
it happening are quite low, especially for primary read_write computes.
But RO replicas / standby_horizon relies on correct
`applied_gc_cutofff_lsn`, so, I'm fixing this as part of the work ok
replacing standby_horizon propagation mechanism with leases (LKB-88).
The change is feature-gated with a feature flag, and evaluated once when
entering `handle_pagestream` to avoid performance impact.
For observability, we add a field to the `handle_pagestream` span, and a
slow-log to the place in `gc_loop` where it waits for the in-flight
RcuReadGuard's to drain.
refs
- fixes https://databricks.atlassian.net/browse/LKB-2572
- standby_horizon leases epic:
https://databricks.atlassian.net/browse/LKB-2572
---------
Co-authored-by: Christian Schwarz <Christian Schwarz>
## Problem
We saw the following in the field:
Context and observations:
* The storage controller keeps track of the latest generations and the
pageserver that issued the latest generation in the database
* When the storage controller needs to proxy a request (e.g. timeline
creation) to the pageservers, it will find use the pageserver that
issued the latest generation from the db (generation_pageserver).
* pageserver-2.cell-2 got into a bad state and wasn't able to apply
location_config (e.g. detach a shard)
What happened:
1. pageserver-2.cell-2 was a secondary for our shard since we were not
able to detach it
2. control plane asked to detach a tenant (presumably because it was
idle)
a. In response storcon clears the generation_pageserver from the db and
attempts to detach all locations
b. it tries to detach pageserver-2.cell-2 first, but fails, which fails
the entire reconciliation leaving the good attached location still there
c. return success to cplane
3. control plane asks to re-attach the tenant
a. In response storcon performs a reconciliation
b. it finds that the observed state matches the intent (remember we did
not detach the primary at step(2))
c. skips incrementing the genration and setting the
generation_pageserver column
Now any requests that need to be proxied to pageservers and rely on the
generation_pageserver db column fail because that's not set
## Summary of changes
1. We do all non-essential location config calls (setting up
secondaries,
detaches) at the end of the reconciliation. Previously, we bailed out
of the reconciliation on the first failure. With this patch we attempt
all of the RPCs.
This allows the observed state to update even if another RPC failed for
unrelated reasons.
2. If the overall reconciliation failed, we don't want to remove nodes
from the
observed state as a safe-guard. With the previous patch, we'll get a
deletion delta to process, which would be ignored. Ignoring it is not
the right thing to do since it's out of sync with the db state.
Hence, on reconciliation failures map deletion from the observed state
to the uncertain state. Future reconciliation will query the node to
refresh their observed state.
Closes LKB-204
Verify that gRPC `GetPageRequest` has been sent to the shard that owns
the pages. This avoid spurious `NotFound` errors if a compute misroutes
a request, which can appear scarier (e.g. data loss).
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
## Problem
In our experience running the system so far, almost all of the "hang
compute" situations are due to the compute (postgres) pointing at the
wrong pageservers. We currently mainly rely on the promethesus exporter
(PGExporter) running on PG to detect and report any down time, but these
can be unreliable because the read and write probes the PGExporter runs
do not always generate pageserver requests due to caching, even though
the real user might be experiencing down time when touching uncached
pages.
We are also about to start disk-wiping node pool rotation operations in
prod clusters for our pageservers, and it is critical to have a
convenient way to monitor the impact of these node pool rotations so
that we can quickly respond to any issues. These metrics should provide
very clear signals to address this operational need.
## Summary of changes
Added a pair of metrics to detect issues between postgres' PageStream
protocol (e.g. get_page_at_lsn, get_base_backup, etc.) communications
with pageservers:
* On the compute node (compute_ctl), exports a counter metric that is
incremented every time postgres requests a configuration refresh.
Postgres today only requests these configuration refreshes when it
cannot connect to a pageserver or if the pageserver rejects its request
by disconnecting.
* On the pageserver, exports a counter metric that is incremented every
time it receives a PageStream request that cannot be handled because the
tenant is not known or if the request was routed to the wrong shard
(e.g. secondary).
### How I plan to use metrics
I plan to use the metrics added here to create alerts. The alerts can
fire, for example, if these counters have been continuously increasing
for over a certain period of time. During rollouts, misrouted requests
may occasionally happen, but they should soon die down as
reconfigurations make progress. We can start with something like raising
the alert if the counters have been increasing continuously for over 5
minutes.
## How is this tested?
New integration tests in
`test_runner/regress/test_hadron_ps_connectivity_metrics.py`
Co-authored-by: William Huang <william.huang@databricks.com>
## Problem
LKB-197, #9516
To make sure the migration path is smooth.
The previous plan is to store new relations in new keyspace and old ones
in old keyspace until it gets dropped. This makes the migration path
hard as we can't validate v2 writes and can't rollback. This patch gives
us a more smooth migration path:
- The first time we enable reldirv2 for a tenant, we copy over
everything in the old keyspace to the new one. This might create a short
spike of latency for the create relation operation, but it's oneoff.
- After that, we have identical v1/v2 keyspace and read/write both of
them. We validate reads every time we list the reldirs.
- If we are in `migrating` mode, use v1 as source of truth and log a
warning for failed v2 operations. If we are in `migrated` mode, use v2
as source of truth and error when writes fail.
- One compatibility test uses dataset from the time where we enabled
reldirv2 (of the original rollout plan), which only has relations
written to the v2 keyspace instead of the v1 keyspace. We had to adjust
it accordingly.
- Add `migrated_at` in index_part to indicate the LSN where we did the
initialize.
TODOs:
- Test if relv1 can be read below the migrated_at LSN.
- Move the initialization process to L0 compaction instead of doing it
on the write path.
- Disable relcache in the relv2 test case so that all code path gets
fully tested.
## Summary of changes
- New behavior of reldirv2 migration flags as described above.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
As part of the reldirv2 rollout: LKB-197.
We will use number of db/rels as a criteria whether to rollout reldirv2
directly on the write path (simplest and easiest way of rollout). If the
number of rel/db is small then it shouldn't take too long time on the
write path.
## Summary of changes
* Compute db/rel count during basebackup.
* Also compute it during logical size computation.
* Collect maximum number of db/rel across all timelines in the feature
flag propeties.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We drive the get page requests that have started processing to
completion. So in the case when the compute received a reconfiguration
request and the old connection has a request procesing on the
pageserver, we are going to issue the warning.
I spot checked a few instances of the warning and in all cases the
compute was already connected to the correct pageserver.
## Summary of Changes
Downgrade to INFO. It would be nice to somehow figure out if the
connection has been terminated in the meantime, but the terminate libpq
message is still in the pipe while we're doing the shard resolution.
Closes LKB-2381
## Problem
Post LKB-198 rollout. We added a new strategy to generate image layers
at the L0-L1 boundary instead of the latest LSN to ensure too many L0
layers do not trigger image layer creation.
## Summary of changes
We already rolled it out to all users so we can remove the feature flag
now.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
NB: effectively a no-op in the neon env since the handling is config
gated
in storcon
## Problem
When a pageserver suffers from a local disk/node failure and restarts,
the storage controller will receive a re-attach call and return all the
tenants the pageserver is suppose to attach, but the pageserver will not
act on any tenants that it doesn't know about locally. As a result, the
pageserver will not rehydrate any tenants from remote storage if it
restarted following a local disk loss, while the storage controller
still thinks that the pageserver have all the tenants attached. This
leaves the system in a bad state, and the symptom is that PG's
pageserver connections will fail with "tenant not found" errors.
## Summary of changes
Made a slight change to the storage controller's `re_attach` API:
* The pageserver will set an additional bit `empty_local_disk` in the
reattach request, indicating whether it has started with an empty disk
or does not know about any tenants.
* Upon receiving the reattach request, if this `empty_local_disk` bit is
set, the storage controller will go ahead and clear all observed
locations referencing the pageserver. The reconciler will then discover
the discrepancy between the intended state and observed state of the
tenant and take care of the situation.
To facilitate rollouts this extra behavior in the `re_attach` API is
guarded by the `handle_ps_local_disk_loss` command line flag of the
storage controller.
---------
Co-authored-by: William Huang <william.huang@databricks.com>
Include the ip address (optionally read from an env var) in the
pageserver's registration request.
Note that the ip address is ignored by the storage controller at the
moment, which makes it a no-op
in the neon env.
## Problem
We've had bugs where the compute would use the stale default stripe size
from an unsharded tenant after the tenant split with a new stripe size.
## Summary of changes
Never specify a stripe size for unsharded tenants, to guard against
misuse. Only specify it once tenants are sharded and the stripe size
can't change.
Also opportunistically changes `GetPageSplitter` to return
`anyhow::Result`, since we'll be using this in other code paths as well
(specifically during server-side shard splits).
## Problem
Postgres will often immediately follow a relation existence check with a
relation size query. This incurs two roundtrips, and may prevent
effective caching.
See [Slack
thread](https://databricks.slack.com/archives/C091SDX74SC/p1751951732136139).
Touches #11728.
## Summary of changes
For the gRPC API:
* Add an `allow_missing` parameter to `GetRelSize`, which returns
`missing=true` instead of a `NotFound` error.
* Remove `CheckRelExists`.
There are no changes to libpq behavior.
## Problem
`ShardStripeSize` will be used in the compute spec and internally in the
communicator. It shouldn't require pulling in all of `pageserver_api`.
## Summary of changes
Move `ShardStripeSize` into `utils::shard`, along with other basic shard
types. Also remove the `Default` implementation, to discourage clients
from falling back to a default (it's generally a footgun).
The type is still re-exported from `pageserver_api::shard`, along with
all the other shard types.
## Problem
The gRPC page service does not properly react to shutdown cancellation.
In particular, Tonic considers an open GetPage stream to be an in-flight
request, so it will wait for it to complete before shutting down.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
## Summary of changes
Properly react to the server's cancellation token and take out gate
guards in gRPC request handlers.
Also document cancellation handling. In particular, that Tonic will drop
futures when clients go away (e.g. on timeout or shutdown), so the read
path must be cancellation-safe. It is believed to be (modulo possible
logging noise), but this will be verified later.
## Problem
part of https://github.com/neondatabase/neon/issues/11318 ; it is not
entirely safe to run gc-compaction over the metadata key range due to
tombstones and implications of image layers (missing key in image layer
== key not exist). The auto gc-compaction trigger already skips metadata
key ranges (see `schedule_auto_compaction` call in
`trigger_auto_compaction`). In this patch we enforce it directly in
gc_compact_inner so that compactions triggered via HTTP API will also be
subject to this restriction.
## Summary of changes
Ensure gc-compaction only runs on rel key ranges.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The warning message was seen during deployment, but it's actually OK.
## Summary of changes
- Treat `"No broker updates received for a while ..."` as an info
message.
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
We don't want to depend on postgres_ffi in an API crate. If there is no
such dependency, we can compile stuff like `storcon_cli` without needing
a full working postgres build. Fixes regression of #12548 (before we
could compile it).
- Remove a few obsolete "allowed error messages" from tests. The
pageserver doesn't emit those messages anymore.
- Remove misplaced and outdated docstring comment from
`test_tenants.py`. A docstring is supposed to be the first thing in a
function, but we had added some code before it. And it was outdated, as
we haven't supported running without safekeepers for a long time.
- Fix misc typos in comments
- Remove obsolete comment about backwards compatibility with safekeepers
without `TIMELINE_STATUS` API. All safekeepers have it by now.
Include records and image in the debug get page handler.
This endpoint does not update the metrics and does not support tracing.
Note that this now returns individual bytes which need to be encoded
properly for debugging.
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
# TLDR
This PR is a no-op.
## Problem
When a SK loses a disk, it must recover all WALs from the very
beginning. This may take days/weeks to catch up to the latest WALs for
all timelines it owns.
## Summary of changes
When SK starts up,
if it finds that it has 0 timelines,
- it will ask SC for the timeline it owns.
- Then, pulls the timeline from its peer safekeepers to restore the WAL
redundancy right away.
After pulling timeline is complete, it will become active and accepts
new WALs.
The current impl is a prototype. We can optimize the impl further, e.g.,
parallel pull timelines.
---------
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
## Problem
The communicator gRPC client currently uses bounded client/stream pools.
This can artificially constrain clients, especially after we remove
pipelining in #12584.
[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.
In the worst case, we'll degenerate to the current libpq state with one
stream per backend, but without the TCP connection overhead. In the
common case we expect significantly lower stream counts due to stream
sharing, driven e.g. by idle backends, LFC hits, read coalescing,
sharding (backends typically only talk to one shard at a time), etc.
Currently, Pageservers rarely serve more than 4000 backend connections,
so we have at least 2 orders of magnitude of headroom.
Touches #11735.
Requires #12584.
## Summary of changes
Remove the pool limits, and restructure the pools.
We still keep a separate bulk pool for Getpage batches of >4 pages (>32
KB), with fewer streams per connection. This reduces TCP-level
congestion and head-of-line blocking for non-bulk requests, and
concentrates larger window sizes on a smaller set of
streams/connections, presumably reducing memory usage. Apart from this,
bulk requests don't have any latency penalty compared to other requests.
## Problem
We don't log the timeline ID when rolling ephemeral layers during
housekeeping.
Resolves [LKB-179](https://databricks.atlassian.net/browse/LKB-179)
## Summary of changes
Add a span with timeline ID when calling `maybe_freeze_ephemeral_layer`
from the housekeeping loop.
We don't instrument the function itself, since future callers may not
have a span including the tenant_id already, but we don't want to
duplicate the tenant_id for these spans.
## Problem
The communicator gRPC client currently attempts to pipeline GetPage
requests from multiple callers onto the same gRPC stream. This has a
number of issues:
* Head-of-line blocking: the request may block on e.g. layer download or
LSN wait, delaying the next request.
* Cancellation: we can't easily cancel in-progress requests (e.g. due to
timeout or backend termination), so it may keep blocking the next
request (even its own retry).
* Complex stream scheduling: picking a stream becomes harder/slower, and
additional Tokio tasks and synchronization is needed for stream
management.
Touches #11735.
Requires #12579.
## Summary of changes
This patch removes pipelining of gRPC stream requests, and instead
prefers to scale out the number of streams to achieve the same
throughput. Stream scheduling has been rewritten, and mostly follows the
same pattern as the client pool with exclusive acquisition by a single
caller.
[Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that
the cost of an idle server-side GetPage worker task is about 26 KB (2.5
GB for 100,000), so we can afford to scale out.
This has a number of advantages:
* It (mostly) eliminates head-of-line blocking (except at the TCP
level).
* Cancellation becomes trivial, by closing the stream.
* Stream scheduling becomes significantly simpler and cheaper.
* Individual callers can still use client-side batching for pipelining.
## Problem
The new communicator gRPC client has significantly worse Pagebench
performance than a basic gRPC client. We need to find out why.
## Summary of changes
Add a `pagebench --profile` flag which takes a client CPU profile of the
benchmark and writes a flamegraph to `profile.svg`.
## Problem
It can take 3x the idle timeout to reap a channel. We have to wait for
the idle timeout to trigger first for the stream, then the client, then
the channel.
Touches #11735.
## Summary of changes
Reap empty channels immediately, and rely indirectly on the
channel/stream timeouts.
This can still lead to 2x the idle timeout for streams (first stream
then client), but that's okay -- if the stream closes abruptly (e.g. due
to timeout or error) we want to keep the client around in the pool for a
while.
## Problem
gRPC client retries currently include pool acquisition under the
per-attempt timeout. If pool acquisition is slow (e.g. full pool), this
will cause spurious timeout warnings, and the caller will lose its place
in the pool queue.
Touches #11735.
## Summary of changes
Makes several improvements to retries and related logic:
* Don't include pool acquisition time under request timeouts.
* Move attempt timeouts out of `Retry` and into the closure.
* Make `Retry` configurable, move constants into main module.
* Don't backoff on the first retry, and reduce initial/max backoffs to
5ms and 5s respectively.
* Add `with_retries` and `with_timeout` helpers.
* Add slow logging for pool acquisition, and a `warn_slow` counterpart
to `log_slow`.
* Add debug logging for requests and responses at the client boundary.
## Problem
For the communicator scheduling policy, we need to understand the
server-side cost of idle gRPC streams.
Touches #11735.
## Summary of changes
Add an `idle-streams` benchmark to `pagebench` which opens a large
number of idle gRPC GetPage streams.
# TLDR
Problem-I is a bug fix. The rest are no-ops.
## Problem I
Page server checks image layer creation based on the elapsed time but
this check depends on the current logical size, which is only computed
on shard 0. Thus, for non-0 shards, the check will be ineffective and
image creation will never be done for idle tenants.
## Summary of changes I
This PR fixes the problem by simply removing the dependency on current
logical size.
## Summary of changes II
This PR adds a timeout when calling page server to split shard to make
sure SC does not wait for the API call forever. Currently the PR doesn't
adds any retry logic because it's not clear whether page server shard
split can be safely retried if the existing operation is still ongoing
or left the storage in a bad state. Thus it's better to abort the whole
operation and restart.
## Problem III
`test_remote_failures` requires PS to be compiled in the testing mode.
For PS in dev/staging, they are compiled without this mode.
## Summary of changes III
Remove the restriction and also increase the number of total failures
allowed.
## Summary of changes IV
remove test on PS getpage http route.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
## Problem
close LKB-162
close https://github.com/neondatabase/cloud/issues/30665, related to
https://github.com/neondatabase/cloud/issues/29434
We see a lot of errors like:
```
2025-05-22T23:06:14.928959Z ERROR compaction_loop{tenant_id=? shard_id=0304}:run:gc_compact_timeline{timeline_id=?}: error applying 4 WAL records 35/DC0DF0B8..3B/E43188C0 (8119 bytes) to key 000000067F0000400500006027000000B9D0, from base image with LSN 0/0 to reconstruct page image at LSN 61/150B9B20 n_attempts=0: apply_wal_records
Caused by:
0: read walredo stdout
1: early eof
```
which is an acceptable form of error and we should downgrade it to
warning.
## Summary of changes
walredo error during gc-compaction is expected when the data below the
gc horizon does not contain a full key history. This is possible in some
rare cases of gc that is only able to remove data in the middle of the
history but not all earlier history when a full keyspace gets deleted.
Signed-off-by: Alex Chi Z <chi@neon.tech>
This is a no-op for the neon deployment
* Introduce the concept image consistent lsn: of the largest LSN below
which all pages have been redone successfully
* Use the image consistent LSN for forced image layer creations
* Optionally expose the image consistent LSN via the timeline describe
HTTP endpoint
* Add a sharded timeline describe endpoint to storcon
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
## Problem
With gRPC `GetPageRequest` batches, we'll have non-trivial
fragmentation/reassembly logic in several places of the stack
(concurrent reads, shard splits, LFC hits, etc). If we included the
block numbers with the pages in `GetPageResponse` we could have better
verification and observability that the final responses are correct.
Touches #11735.
Requires #12480.
## Summary of changes
Add a `Page` struct with`block_number` for `GetPageResponse`, along with
the `RelTag` for completeness, and verify them in the rich gRPC client.
## Problem
`GetPageRequest::request_id` is supposed to be a unique ID for a
request. It's not, because we may retry the request using the same ID.
This causes assertion failures and confusion.
Touches #11735.
Requires #12480.
## Summary of changes
Extend the request ID with a retry attempt, and handle it in the gRPC
client and server.
## Problem
We need to benchmark the rich gRPC client
`client_grpc::PageserverClient` against the basic, no-frills
`page_api::Client` to determine how much overhead it adds.
Touches #11735.
Requires #12476.
## Summary of changes
Add a `pagebench --rich-client` parameter to use
`client_grpc::PageserverClient`. Also adds a compression parameter to
the client.
## Problem
The communicator gRPC client must support changing the shard map on
splits.
Touches #11735.
Requires #12476.
## Summary of changes
* Wrap the shard set in a `ArcSwap` to allow swapping it out.
* Add a new `ShardSpec` parameter struct to pass validated shard info to
the client.
* Add `update_shards()` to change the shard set. In-flight requests are
allowed to complete using the old shards.
* Restructure `get_page` to use a stable view of the shard map, and
retry errors at the top (pre-split) level to pick up shard map changes.
* Also marks `tonic::Status::Internal` as non-retryable, so that we can
use it for client-side invariant checks without continually retrying
these.
# TLDR
All changes are no-op except some metrics.
## Summary of changes I
### Pageserver
Added a new global counter metric
`pageserver_pagestream_handler_results_total` that categorizes
pagestream request results according to their outcomes:
1. Success
2. Internal errors
3. Other errors
Internal errors include:
1. Page reconstruction error: This probably indicates a pageserver
bug/corruption
2. LSN timeout error: Could indicate overload or bugs with PS's ability
to reach other components
3. Misrouted request error: Indicates bugs in the Storage Controller/HCC
Other errors include transient errors that are expected during normal
operation or errors indicating bugs with other parts of the system
(e.g., malformed requests, errors due to cancelled operations during PS
shutdown, etc.)
## Summary of changes II
This PR adds a pageserver endpoint and its counterpart in storage
controller to list visible size of all tenant shards. This will be a
prerequisite of the tenant rebalance command.
## Problem III
We need a way to download WAL
segments/layerfiles from S3 and replay WAL records. We cannot access
production S3 from our laptops directly, and we also can't transfer any
user data out of production systems for GDPR compliance, so we need
solutions.
## Summary of changes III
This PR adds a couple of tools to support the debugging
workflow in production:
1. A new `pagectl download-remote-object` command that can be used to
download remote storage objects assuming the correct access is set up.
## Summary of changes IV
This PR adds a command to list all visible delta and image layers from
index_part. This is useful to debug compaction issues as index_part
often contain a lot of covered layers due to PITR.
---------
Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>