Commit Graph

1247 Commits

Author SHA1 Message Date
Heikki Linnakangas
fc35be0397 Remove the half-baked Adaptive Radix Tree implementation
We are committed to using the resizeable hash table for now. ART is a
great data structure, but it's too much for now. Maybe later.
2025-07-23 01:49:56 +03:00
Heikki Linnakangas
48535798ba Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-23 00:00:10 +03:00
HaoyuHuang
8de320ab9b Add a few compute_tool changes (#12677)
## Summary of changes
All changes are no-op.
2025-07-22 16:22:18 +00:00
Folke Behrens
108f7ec544 Bump opentelemetry crates to 0.30 (#12680)
This rebuilds #11552 on top the current Cargo.lock.

---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2025-07-22 16:05:35 +00:00
Vlad Lazar
d91d018afa storcon: handle pageserver disk loss (#12667)
NB: effectively a no-op in the neon env since the handling is config
gated
in storcon

## Problem

When a pageserver suffers from a local disk/node failure and restarts,
the storage controller will receive a re-attach call and return all the
tenants the pageserver is suppose to attach, but the pageserver will not
act on any tenants that it doesn't know about locally. As a result, the
pageserver will not rehydrate any tenants from remote storage if it
restarted following a local disk loss, while the storage controller
still thinks that the pageserver have all the tenants attached. This
leaves the system in a bad state, and the symptom is that PG's
pageserver connections will fail with "tenant not found" errors.

## Summary of changes

Made a slight change to the storage controller's `re_attach` API:
* The pageserver will set an additional bit `empty_local_disk` in the
reattach request, indicating whether it has started with an empty disk
or does not know about any tenants.
* Upon receiving the reattach request, if this `empty_local_disk` bit is
set, the storage controller will go ahead and clear all observed
locations referencing the pageserver. The reconciler will then discover
the discrepancy between the intended state and observed state of the
tenant and take care of the situation.

To facilitate rollouts this extra behavior in the `re_attach` API is
guarded by the `handle_ps_local_disk_loss` command line flag of the
storage controller.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-22 11:04:03 +00:00
Folke Behrens
9c0efba91e Bump rand crate to 0.9 (#12674) 2025-07-22 09:31:39 +00:00
Arpad Müller
80baeaa084 storcon: add force_upsert flag to timeline_import endpoint (#12622)
It is useful to have ability to update an existing timeline entry, as a
way to mirror legacy migrations to the storcon managed table.
2025-07-21 21:14:15 +00:00
Conrad Ludgate
b2ecb10f91 [proxy] rework handling of notices in sql-over-http (#12659)
A replacement for #10254 which allows us to introduce notice messages
for sql-over-http in the future if we want to. This also removes the
`ParameterStatus` and `Notification` handling as there's nothing we
could/should do for those.
2025-07-21 12:50:13 +00:00
Erik Grinaker
e181b996c3 utils: move ShardStripeSize into shard module (#12640)
## Problem

`ShardStripeSize` will be used in the compute spec and internally in the
communicator. It shouldn't require pulling in all of `pageserver_api`.

## Summary of changes

Move `ShardStripeSize` into `utils::shard`, along with other basic shard
types. Also remove the `Default` implementation, to discourage clients
from falling back to a default (it's generally a footgun).

The type is still re-exported from `pageserver_api::shard`, along with
all the other shard types.
2025-07-21 10:56:20 +00:00
Heikki Linnakangas
e2c3c2eccb Merge remote-tracking branch 'origin/main' into HEAD 2025-07-20 00:58:57 +03:00
Victor Polevoy
cb50291dcd Fetches the SLRU segment via the new communicator.
The fetch is done not into a buffer as earlier, but directly into the
file.
2025-07-18 10:02:31 +02:00
Alexey Kondratov
53a05e8ccb fix(compute_ctl): Only offload LFC state if no prewarming is in progress (#12645)
## Problem

We currently offload LFC state unconditionally, which can cause
problems. Imagine a situation:
1. Endpoint started with `autoprewarm: true`.
2. While prewarming is not completed, we upload the new incomplete
state.
3. Compute gets interrupted and restarts.
4. We start again and try to prewarm with the state from 2. instead of
the previous complete state.

During the orchestrated prewarming, it's probably not a big issue, but
it's still better to do not interfere with the prewarm process.

## Summary of changes

Do not offload LFC state if we are currently prewarming or any issue
occurred. While on it, also introduce `Skipped` LFC prewarm status,
which is used when the corresponding LFC state is not present in the
endpoint storage. It's primarily needed to distinguish the first compute
start for particular endpoint, as it's completely valid to do not have
LFC state yet.
2025-07-17 21:43:43 +00:00
Vlad Lazar
62c0152e6b pageserver: shut down compute connections at libpq level (#12642)
## Problem

Previously, if a get page failure was cause by timeline shutdown, the
pageserver would attempt to tear down the connection gracefully:
`shutdown(SHUT_WR)` followed by `close()`.

This triggers a code path on the compute where it has to tell apart
between an idle connection and a closed one. That code is bug prone, so
we can just side-step the issue by shutting down the connection via a
libpq error message.

This surfaced as instability in test_shard_resolve_during_split_abort.
It's a new test, but the issue existed for ages.

## Summary of Changes

Send a libpq error message instead of doing graceful TCP connection
shutdown.

Closes LKB-648
2025-07-17 21:03:55 +00:00
Conrad Ludgate
8b0f2efa57 experiment with an InfoMetrics metric family (#12612)
Putting this in the neon codebase for now, to experiment. Can be lifted
into measured at a later date.

This metric family is like a MetricVec, but it only supports 1 label
being set at a time. It is useful for reporting info, rather than
reporting metrics.
https://www.robustperception.io/exposing-the-software-version-to-prometheus/
2025-07-17 17:58:47 +00:00
quantumish
b309cbc6e9 Add resizable hashmap and RwLock implementations to neon-shmem (#12596)
Second PR for the hashmap behind the updated LFC implementation ([see
first here](https://github.com/neondatabase/neon/pull/12595)). This only
adds the raw code for the hashmap/lock implementations and doesn't plug
it into the crate (that's dependent on the previous PR and should
probably be done when the full integration into the new communicator is
merged alongside `communicator-rewrite` changes?).

Some high level details: the communicator codebase expects to be able to
store references to entries within this hashmap for arbitrary periods of
time and so the hashmap cannot be allowed to move them during a rehash.
As a result, this implementation has a slightly unusual structure where
key-value pairs (and hash chains) are allocated in a separate region
with a freelist. The core hashmap structure is then an array of
"dictionary entries" that are just indexes into this region of key-value
pairs.

Concurrency support is very naive at the moment with the entire map
guarded by one big `RwLock` (which is implemented on top of a
`pthread_rwlock_t` since Rust doesn't guarantee that a
`std::sync::RwLock` is safe to use in shared memory). This (along with a
lot of other things) is being changed on the
`quantumish/lfc-resizable-map` branch.
2025-07-17 17:40:53 +00:00
HaoyuHuang
b7fc5a2fe0 A few SC changes (#12615)
## Summary of changes
A bunch of no-op changes.

---------

Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-17 13:14:36 +00:00
Erik Grinaker
edcdd6ca9c Merge branch 'main' into communicator-rewrite 2025-07-17 10:59:37 +02:00
Tristan Partin
9e154a8130 PG: smooth max wal rate (#12514)
## Problem
We were only resetting the limit in the wal proposer. If backends are
back pressured, it might take a while for the wal proposer to receive a
new WAL to reset the limit.

## Summary of changes
Backend also checks the time and resets the limit.

## How is this tested?
pgbench has more smooth tps

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
2025-07-16 16:11:25 +00:00
quantumish
809633903d Move ShmemHandle into separate module, tweak documentation (#12595)
Initial PR for the hashmap behind the updated LFC implementation. This
refactors `neon-shmem` so that the actual shared memory utilities are in
a separate module within the crate. Beyond that, it slightly changes
some of the docstrings so that they play nicer with `cargo doc`.
2025-07-15 17:40:40 +00:00
Arpad Müller
5c934efb29 Don't depend on the postgres_ffi just for one type (#12610)
We don't want to depend on postgres_ffi in an API crate. If there is no
such dependency, we can compile stuff like `storcon_cli` without needing
a full working postgres build. Fixes regression of #12548 (before we
could compile it).
2025-07-15 17:28:08 +00:00
Erik Grinaker
367d96e25b Merge branch 'main' into communicator-rewrite 2025-07-14 18:47:23 +02:00
HaoyuHuang
f67a8a173e A few SK changes (#12577)
# TLDR 
This PR is a no-op. 

## Problem
When a SK loses a disk, it must recover all WALs from the very
beginning. This may take days/weeks to catch up to the latest WALs for
all timelines it owns.

## Summary of changes
When SK starts up,
if it finds that it has 0 timelines,
- it will ask SC for the timeline it owns.
- Then, pulls the timeline from its peer safekeepers to restore the WAL
redundancy right away.

After pulling timeline is complete, it will become active and accepts
new WALs.

The current impl is a prototype. We can optimize the impl further, e.g.,
parallel pull timelines.

---------

Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
2025-07-14 16:37:04 +00:00
Erik Grinaker
d14d8271b8 pageserver/client_grpc: improve retry logic (#12579)
## Problem

gRPC client retries currently include pool acquisition under the
per-attempt timeout. If pool acquisition is slow (e.g. full pool), this
will cause spurious timeout warnings, and the caller will lose its place
in the pool queue.

Touches #11735.

## Summary of changes

Makes several improvements to retries and related logic:

* Don't include pool acquisition time under request timeouts.
* Move attempt timeouts out of `Retry` and into the closure.
* Make `Retry` configurable, move constants into main module.
* Don't backoff on the first retry, and reduce initial/max backoffs to
5ms and 5s respectively.
* Add `with_retries` and `with_timeout` helpers.
* Add slow logging for pool acquisition, and a `warn_slow` counterpart
to `log_slow`.
* Add debug logging for requests and responses at the client boundary.
2025-07-14 10:43:10 +00:00
Heikki Linnakangas
69dbad700c Merge remote-tracking branch 'origin/main' into HEAD 2025-07-12 16:43:57 +03:00
Erik Grinaker
0d5f4dd979 pageserver/client_grpc: improve retry logic 2025-07-12 12:41:11 +02:00
HaoyuHuang
cb991fba42 A few more PS changes (#12552)
# TLDR
Problem-I is a bug fix. The rest are no-ops. 

## Problem I
Page server checks image layer creation based on the elapsed time but
this check depends on the current logical size, which is only computed
on shard 0. Thus, for non-0 shards, the check will be ineffective and
image creation will never be done for idle tenants.

## Summary of changes I
This PR fixes the problem by simply removing the dependency on current
logical size.

## Summary of changes II
This PR adds a timeout when calling page server to split shard to make
sure SC does not wait for the API call forever. Currently the PR doesn't
adds any retry logic because it's not clear whether page server shard
split can be safely retried if the existing operation is still ongoing
or left the storage in a bad state. Thus it's better to abort the whole
operation and restart.

## Problem III
`test_remote_failures` requires PS to be compiled in the testing mode.
For PS in dev/staging, they are compiled without this mode.

## Summary of changes III
Remove the restriction and also increase the number of total failures
allowed.

## Summary of changes IV
remove test on PS getpage http route.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-11 19:27:55 +00:00
Vlad Lazar
15f633922a pageserver: use image consistent LSN for force image layer creation (#12547)
This is a no-op for the neon deployment

* Introduce the concept image consistent lsn: of the largest LSN below
which all pages have been redone successfully
* Use the image consistent LSN for forced image layer creations
* Optionally expose the image consistent LSN via the timeline describe
HTTP endpoint
* Add a sharded timeline describe endpoint to storcon

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-11 11:39:51 +00:00
Dmitrii Kovalkov
c34d36d8a2 storcon_cli: timeline-safekeeper-migrate and timeline-locate subcommands (#12548)
## Problem
We have a `safekeeper_migrate` handler, but no subcommand in
`storcon_cli`. Same for `/:timeline_id/locate` for identifying current
set of safekeepers.

- Closes: https://github.com/neondatabase/neon/issues/12395

## Summary of changes
- Add `timeline-safekeeper-migrate` and `timeline-locate` subcommands to
`storcon_cli`
2025-07-11 10:49:37 +00:00
Erik Grinaker
8cd5370c00 Merge branch 'main' into communicator-rewrite 2025-07-11 10:39:26 +02:00
Tristan Partin
1b7339b53e PG: add max_wal_rate (#12470)
## Problem
One PG tenant may write too fast and overwhelm the PS. The other tenants
sharing the same PSs will get very little bandwidth.

We had one experiment that two tenants sharing the same PSs. One tenant
runs a large ingestion that delivers hundreds of MB/s while the other
only get < 10 MB/s.

## Summary of changes
Rate limit how fast PG can generate WALs. The default is -1. We may
scale the default value with the CPU count. Need to run some experiments
to verify.

## How is this tested?
CI.

PGBench. No limit first. Then set to 1 MB/s and you can see the tps
drop. Then reverted the change and tps increased again.

pgbench -i -s 10 -p 55432 -h 127.0.0.1 -U cloud_admin -d postgres
pgbench postgres -c 10 -j 10 -T 6000000 -P 1 -b tpcb-like -h 127.0.0.1
-U cloud_admin -p 55432
progress: 33.0 s, 986.0 tps, lat 10.142 ms stddev 3.856 progress: 34.0
s, 973.0 tps, lat 10.299 ms stddev 3.857 progress: 35.0 s, 1004.0 tps,
lat 9.939 ms stddev 3.604 progress: 36.0 s, 984.0 tps, lat 10.183 ms
stddev 3.713 progress: 37.0 s, 998.0 tps, lat 10.004 ms stddev 3.668
progress: 38.0 s, 648.9 tps, lat 12.947 ms stddev 24.970 progress: 39.0
s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 40.0 s, 0.0 tps, lat
0.000 ms stddev 0.000 progress: 41.0 s, 0.0 tps, lat 0.000 ms stddev
0.000 progress: 42.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress:
43.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 44.0 s, 0.0 tps,
lat 0.000 ms stddev 0.000 progress: 45.0 s, 0.0 tps, lat 0.000 ms stddev
0.000 progress: 46.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress:
47.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 48.0 s, 0.0 tps,
lat 0.000 ms stddev 0.000 progress: 49.0 s, 347.3 tps, lat 321.560 ms
stddev 1805.633 progress: 50.0 s, 346.8 tps, lat 9.898 ms stddev 3.809
progress: 51.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 52.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 53.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 54.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 55.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 56.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 57.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 58.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 59.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 60.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 61.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 62.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 63.0 s, 494.5 tps, lat 276.504 ms stddev 1853.689 progress:
64.0 s, 488.0 tps, lat 20.530 ms stddev 71.981 progress: 65.0 s, 407.8
tps, lat 9.502 ms stddev 3.329 progress: 66.0 s, 0.0 tps, lat 0.000 ms
stddev 0.000 progress: 67.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 68.0 s, 504.5 tps, lat 71.627 ms stddev 397.733 progress: 69.0
s, 371.0 tps, lat 24.898 ms stddev 29.007 progress: 70.0 s, 541.0 tps,
lat 19.684 ms stddev 24.094 progress: 71.0 s, 342.0 tps, lat 29.542 ms
stddev 54.935

Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
2025-07-10 20:34:11 +00:00
Mikhail
3593fe195a split TerminationPending into two values, keeping ComputeStatus stateless (#12506)
After https://github.com/neondatabase/neon/pull/12240 we observed
issues in our go code as `ComputeStatus` is not stateless, thus doesn't
deserialize as string.

```
could not check compute activity: json: cannot unmarshal object into Go struct field
ComputeState.status of type computeclient.ComputeStatus
```

- Fix this by splitting this status into two.
- Update compute OpenApi spec to reflect changes to `/terminate` in
previous PR
2025-07-10 19:28:10 +00:00
Vlad Lazar
ffeede085e libs: move metric collection for pageserver and safekeeper in a background task (#12525)
## Problem

Safekeeper and pageserver metrics collection might time out. We've seen
this in both hadron and neon.

## Summary of changes

This PR moves metrics collection in PS/SK to the background so that we
will always get some metrics, despite there may be some delays. Will
leave it to the future work to reduce metrics collection time.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-10 11:58:22 +00:00
Vlad Lazar
08b19f001c pageserver: optionally force image layer creation on timeout (#12529)
This PR introduces a `image_creation_timeout` to page servers so that we
can force the image creation after a certain period. This is set to 1
day on dev/staging for now, and will rollout to production 1/2 weeks
later.

Majority of the PR are boilerplate code to add the new knob. Specific
changes of the PR are:
1. During L0 compaction, check if we should force a compaction if
min(LSN) of all delta layers < force_image_creation LSN.
2. During image creation, check if we should force a compaction if the
image's LSN < force_image_creation LSN and there are newer deltas with
overlapping key ranges.
3. Also tweaked the check image creation interval to make sure we honor
image_creation_timeout.

Vlad's note: This should be a no-op. I added an extra PS config for the
large timeline
threshold to enable this.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-10 10:07:21 +00:00
Vlad Lazar
fe0ddb7169 libs: make remote storage failure injection probabilistic (#12526)
Change the unreliable storage wrapper to fail by probability when there
are more failure attempts left.

Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
2025-07-09 17:41:34 +00:00
Mikhail
e7d18bc188 Replica promotion in compute_ctl (#12183)
Add `/promote` method for `compute_ctl` promoting secondary replica to
primary,
depends on secondary being prewarmed.
Add `compute-ctl` mode to `test_replica_promotes`, testing happy path
only (no corner cases yet)
Add openapi spec for `/promote` and `/lfc` handlers

https://github.com/neondatabase/cloud/issues/19011
Resolves: https://github.com/neondatabase/cloud/issues/29807
2025-07-09 12:55:10 +00:00
Arpad Müller
7049003cf7 storcon: print viability of --timelines-onto-safekeepers (#12485)
The `--timelines-onto-safekeepers` flag is very consequential in the
sense that it controls every single timeline creation. However, we don't
have any automatic insight whether enabling the option will break things
or not.

The main way things can break is by misconfigured safekeepers, say they
are marked as paused in the storcon db. The best input so far we can
obtain via manually connecting via storcon_cli and listing safekeepers,
but this is cumbersome and manual so prone to human error.

So at storcon startup, do a simulated "test creation" in which we call
`timelines_onto_safekeepers` with the configuration provided to us, and
print whether it was successful or not. No actual timeline is created,
and nothing is written into the storcon db. The heartbeat info will not
have reached us at that point yet, but that's okay, because we still
fall back to safekeepers that don't have any heartbeat.

Also print some general scheduling policy stats on initial safekeeper
load.

Part of #11670.
2025-07-09 12:02:44 +00:00
Erik Grinaker
3915995530 pageserver/client_grpc: add rich Pageserver gRPC client (#12462)
## Problem

For the communicator, we need a rich Pageserver gRPC client.

Touches #11735.
Requires #12434.

## Summary of changes

This patch adds an initial rich Pageserver gRPC client. It supports:

* Sharded tenants across multiple Pageservers.
* Pooling of connections, clients, and streams for efficient resource
use.
* Concurrent use by many callers.
* Internal handling of GetPage bidirectional streams, with pipelining
and error handling.
* Automatic retries.
* Observability.

The client is still under development. In particular, it needs GetPage
batch splitting, shard map updates, and performance optimization. This
will be addressed in follow-up PRs.
2025-07-09 11:42:46 +00:00
Trung Dinh
4dee2bfd82 pageserver: Introduce config to enable/disable eviction task (#12496)
## Problem
We lost capability to explicitly disable the global eviction task (for
testing).

## Summary of changes
Add an `enabled` flag to `DiskUsageEvictionTaskConfig` to indicate
whether we should run the eviction job or not.
2025-07-08 21:14:04 +00:00
HaoyuHuang
3dad4698ec PS changes #1 (#12467)
# TLDR
All changes are no-op except 
1. publishing additional metrics. 
2. problem VI

## Problem I

It has come to my attention that the Neon Storage Controller doesn't
correctly update its "observed" state of tenants previously associated
with PSs that has come back up after a local data loss. It would still
think that the old tenants are still attached to page servers and won't
ask more questions. The pageserver has enough information from the
reattach request/response to tell that something is wrong, but it
doesn't do anything about it either. We need to detect this situation in
production while I work on a fix.

(I think there is just some misunderstanding about how Neon manages
their pageserver deployments which got me confused about all the
invariants.)

## Summary of changes I

Added a `pageserver_local_data_loss_suspected` gauge metric that will be
set to 1 if we detect a problematic situation from the reattch response.
The problematic situation is when the PS doesn't have any local tenants
but received a reattach response containing tenants.

We can set up an alert using this metric. The alert should be raised
whenever this metric reports non-zero number.

Also added a HTTP PUT
`http://pageserver/hadron-internal/reset_alert_gauges` API on the
pageserver that can be used to reset the gauge and the alert once we
manually rectify the situation (by restarting the HCC).

## Problem II
Azure upload is 3x slower than AWS. -> 3x slower ingestion. 

The reason for the slower upload is that Azure upload in page server is
much slower => higher flush latency => higher disk consistent LSN =>
higher back pressure.

## Summary of changes II
Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in
parallel.

I set the put_block block size to be 128 MB by default in azure config. 

To minimize neon changes, upload function passes the layer file path to
the azure upload code through the storage metadata. This allows the
azure put block to use FileChunkStreamRead to stream read from one
partition in the file instead of loading all file data in memory and
split it into 8 128 MB chunks.

## How is this tested? II
1. rust test_real_azure tests the put_block change. 
3. I deployed the change in azure dev and saw flush latency reduces from
~30 seconds to 10 seconds.
4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS
runs.

## Problem III
Currently Neon limits the compaction tasks as 3/4 * CPU cores. This
limits the overall compaction throughput and it can easily cause
head-of-the-line blocking problems when a few large tenants are
compacting.

## Summary of changes III
This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD`
(default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also
limits some other tasks `logical_size_calculation` and `layer eviction`
. But compaction should be the most frequent and time-consuming task.

## Summary of changes IV
This PR adds the following PageServer metrics:
1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures
the total amount of bytes evicted. It's more straightforward to see the
bytes directly instead of layers.
2. `pageserver_active_storage_operations_count`: captures the active
storage operation, e.g., flush, L0 compaction, image creation etc. It's
useful to visualize these active operations to get a better idea of what
PageServers are spending cycles on in the background.

## Summary of changes V
When investigating data corruptions, it's useful to search the base
image and all WAL records of a page up to an LSN, i.e., a breakdown of
GetPage@LSN request. This PR implements this functionality with two
tools:

1. Extended `pagectl` with a new command to search the layer files for a
given key up to a given LSN from the `index_part.json` file. The output
can be used to download the files from S3 and then search the file
contents using the second tool.
Example usage:
```
cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8
```
Example output:
```
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b
tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b
```

2. Added a unit test to search the layer file contents. It's not
implemented part of `pagectl` because it depends on some test harness
code, which can only be used by unit tests.

Example usage:
```
cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8
```
Example output:
```
# omitted image for brievity
delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA=="
delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA=="
```

## Problem VI
Currently when page service resolves shards from page numbers, it
doesn't fully support the case that the shard could be split in the
middle. This will lead to query failures during the tenant split for
either commit or abort cases (it's mostly for abort).

## Summary of changes VI
This PR adds retry logic in `Cache::get()` to deal with shard resolution
errors more gracefully. Specifically, it'll clear the cache and retry,
instead of failing the query immediately. It also reduces the internal
timeout to make retries faster.

The PR also fixes a very obvious bug in
`TenantManager::resolve_attached_shard` where the code tries to cache
the computed the shard number, but forgot to recompute when the shard
count is different.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-08 19:43:01 +00:00
Folke Behrens
29d73e1404 http-utils: Temporarily accept duplicate params (#12504)
## Problem

Grafana Alloy in cluster mode seems to send duplicate "seconds" scrape
URL parameters
when one of its instances is disrupted.

## Summary of changes

Temporarily accept duplicate parameters as long as their value is
identical.
2025-07-08 15:49:42 +00:00
Conrad Ludgate
55aef2993d introduce a JSON serialization lib (#12417)
See #11992 and #11961 for some examples of usecases.

This introduces a JSON serialization lib, designed for more flexibility
than serde_json offers.

## Dynamic construction

Sometimes you have dynamic values you want to serialize, that are not
already in a serde-aware model like a struct or a Vec etc. To achieve
this with serde, you need to implement a lot of different traits on a
lot of different new-types. Because of this, it's often easier to
give-in and pull all the data into a serde-aware model
(serde_json::Value or some intermediate struct), but that is often not
very efficient.

This crate allows full control over the JSON encoding without needing to
implement any extra traits. Just call the relevant functions, and it
will guarantee a correctly encoded JSON value.

## Async construction

Similar to the above, sometimes the values arrive asynchronously. Often
collecting those values in memory is more expensive than writing them as
JSON, since the overheads of `Vec` and `String` is much higher, however
there are exceptions.

Serializing to JSON all in one go is also more CPU intensive and can
cause lag spikes, whereas serializing values incrementally spreads out
the CPU load and reduces lag.
2025-07-07 15:12:02 +00:00
Dmitrii Kovalkov
fc10bb9438 storage: rename term -> last_log_term in TimelineMembershipSwitchResponse (#12481)
## Problem
Names are not consistent between safekeeper migration RFC and the actual
implementation.

It's not used anywhere in production yet, so it's safe to rename. We
don't need to worry about backward compatibility.

- Follow up on https://github.com/neondatabase/neon/pull/12432

## Summary of changes
- rename term -> last_log_term in TimelineMembershipSwitchResponse 
- add missing fields to TimelineMembershipSwitchResponse in python
2025-07-07 09:22:03 +00:00
Heikki Linnakangas
e14bb4be39 Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-05 16:59:51 +03:00
Heikki Linnakangas
9a37bfdf63 Fix re-finding an entry in bucket chain 2025-07-05 00:44:46 +03:00
Heikki Linnakangas
50fbf4ac53 Fix hash table initialization across forked processes
attach_writer()/reader() are called from each forked process. It's too
late to do initialization there, in fact we used to overwrite the
contents of the hash table (or at least the freelist?) every time a
new process attached to it. The initialization must be done earlier,
in the HashMapInit() constructors.
2025-07-04 23:08:34 +03:00
Mikhail
7ed4530618 offload_lfc_interval_seconds in ComputeSpec (#12447)
- Add ComputeSpec flag `offload_lfc_interval_seconds` controlling
  whether LFC should be offloaded to endpoint storage. Default value
  (None) means "don't offload".
- Add glue code around it for `neon_local` and integration tests.
- Add `autoprewarm` mode for `test_lfc_prewarm` testing
  `offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction.
- Rename `compute_ctl_lfc_prewarm_requests_total` and
`compute_ctl_lfc_offload_requests_total` to
`compute_ctl_lfc_prewarms_total`
  and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and
  offloads, not `compute_ctl` requests of those.
  Don't count request in metrics if there is a prewarm/offload already
  ongoing.

https://github.com/neondatabase/cloud/issues/19011
Resolves: https://github.com/neondatabase/cloud/issues/30770
2025-07-04 18:49:57 +00:00
Aleksandr Sarantsev
b2705cfee6 storcon: Make node deletion process cancellable (#12320)
## Problem

The current deletion operation is synchronous and blocking, which is
unsuitable for potentially long-running tasks like. In such cases, the
standard HTTP request-response pattern is not a good fit.

## Summary of Changes

- Added new `storcon_cli` commands: `NodeStartDelete` and
`NodeCancelDelete` to initiate and cancel deletion asynchronously.
- Added corresponding `storcon` HTTP handlers to support the new
start/cancel deletion flow.
- Introduced a new type of background operation: `Delete`, to track and
manage the deletion process outside the request lifecycle.

---------

Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
2025-07-04 14:08:09 +00:00
Heikki Linnakangas
da3f9ee72d cargo fmt 2025-07-04 12:39:41 +03:00
Alex Chi Z.
cc699f6f85 fix(pageserver): do not log no-route-to-host errors (#12468)
## Problem

close https://github.com/neondatabase/neon/issues/12344

## Summary of changes

Add `HostUnreachable` and `NetworkUnreachable` to expected I/O error.
This was new in Rust 1.83.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-03 21:57:42 +00:00
David Freifeld
794bb7a9e8 Merge branch 'quantumish/comm-lfc-integration' into communicator-rewrite 2025-07-03 10:52:29 -07:00