Commit Graph

8282 Commits

Author SHA1 Message Date
Victor Polevoy
e191daa152 Add a test for generating pprof from collapsed 2025-08-21 12:40:57 +02:00
Victor Polevoy
7572ffc725 Document the continuous profiling of computes 2025-08-21 11:53:34 +02:00
Victor Polevoy
0aacbc2583 Install bpfcc tools in the compute node docker image.
This provides the binaries and libraries required for the continuous
profiling at runtime.
2025-08-21 11:16:00 +02:00
Victor Polevoy
cb9874dc4e After switching to ext4 img, symlinking is possible 2025-08-20 16:24:07 +02:00
Victor Polevoy
891c1fe512 Fix the kernel headers enabling 2025-08-02 14:03:25 +02:00
Victor Polevoy
9cae494555 TRY TRY TRY (pink) 2025-07-31 13:02:45 +02:00
Victor Polevoy
c3e6d360b5 Clarify the reason for postgres comment 2025-07-18 12:52:57 +02:00
Victor Polevoy
9e69e24a52 Use the supplied kernel headers and modules 2025-07-18 12:52:45 +02:00
Victor Polevoy
8dbf5a8c5b Add annotation to skip on CI and macos 2025-07-15 12:30:34 +02:00
Victor Polevoy
463429af97 build-tools 2025-07-11 16:20:04 +02:00
Victor Polevoy
d8b0c0834e Implement HTTP endpoint for compute profiling.
Exposes an endpoint "/profile/cpu" for profiling the postgres
processes (currently spawned and the new ones) using "perf".

Adds the corresponding python test to test the added endpoint
and confirm the output expected is the profiling data in the
expected format.

Add "perf" binary to the sudo list.

Fix python poetry ruff

Address the clippy lints

Document the code

Format python code

Address code review

Prettify

Embed profile_pb2.py and small code/test fixes.

Make the code slightly better.

1. Makes optional the sampling_frequency parameter for profiling.
2. Avoids using unsafe code when killing a child.

Better code, better tests

More tests

Separate start and stop of profiling

Correctly check for the exceptions

Address clippy lint

Final fixes.

1. Allows the perf to be found in $PATH instead of having the path
hardcoded.
2. Changes the path to perf in the sudoers file so that the compute
can run it properly.
3. Changes the way perf is invoked, now it is with sudo and the path
from $PATH.
4. Removes the authentication requirement from the /profile/cpu/
endpoint.

hakari thing

Python fixes

Fix python formatting

More python fixes

Update poetry lock

Fix ruff

Address the review comments

Fix the tests

Try fixing the flaky test for pg17?

Try fixing the flaky test for pg17?

PYTHON

Fix the tests

Remove the PROGRESS parameter

Remove unused

Increase the timeout due to concurrency

Increase the timeout to 60

Increase the profiling window timeout

Try this

Lets see the error

Just log all the errors

Add perf into the build environment

uijdfghjdf

Update tempfile to 3.20

Snapshot

Use bbc-profile

Update tempfile to 3.20

Provide bpfcc-tools in debian

Properly respond with status

Python check

Fix build-tools dockerfile

Add path probation for the bcc profile

Try err printing

Refactor

Add bpfcc-tools to the final image

Add error context

sudo not found?

Print more errors for verbosity

Remove procfs and use libproc

Update hakari

Debug sudo in CI

Rebase and adjust hakari

remove leftover

Add archiving support

Correct the paths to the perf binary

Try hardcoded sudo path

Add sudo into build-tools dockerfile

Minor cleanup

Print out the sudoers file from github

Stop the tests earlier

Add the sudoers entry for nonroot, install kmod for modprobe for bcc-profile

Try hacking the kernel headers for bcc-profile

Redeclare the kernel version argument

Try using the kernel of the runner

Try another way

Check bpfcc-tools
2025-07-11 12:53:48 +02:00
Tristan Partin
cec0543b51 Add background to compute migration 0002-alter_roles.sql (#11708)
On December 8th, 2023, an engineering escalation (INC-110) was opened
after it was found that BYPASSRLS was being applied to all roles.

PR that introduced the issue:
https://github.com/neondatabase/neon/pull/5657
Subsequent commit on main:
ad99fa5f03

NOBYPASSRLS and INHERIT are the defaults for a Postgres role, but
because it isn't easy to know if a Postgres cluster is affected by the
issue, we need to keep the migration around for a long time, if not
indefinitely, so any cluster can be fixed.

Branching is the gift that keeps on giving...

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-10 22:58:54 +00:00
Erik Grinaker
8aa9540a05 pageserver/page_api: include block number and rel in gRPC GetPageResponse (#12542)
## Problem

With gRPC `GetPageRequest` batches, we'll have non-trivial
fragmentation/reassembly logic in several places of the stack
(concurrent reads, shard splits, LFC hits, etc). If we included the
block numbers with the pages in `GetPageResponse` we could have better
verification and observability that the final responses are correct.

Touches #11735.
Requires #12480.

## Summary of changes

Add a `Page` struct with`block_number` for `GetPageResponse`, along with
the `RelTag` for completeness, and verify them in the rich gRPC client.
2025-07-10 22:35:14 +00:00
Alex Chi Z.
b91f821e8b fix(libpagestore): update the default stripe size (#12557)
## Problem

Part of LKB-379

The pageserver connstrings are updated in the postmaster and then
there's a hook to propagate it to the shared memory of all backends.
However, the shard stripe doesn't. This would cause problems during
shard splits:

* the compute has active reads/writes
* shard split happens and the cplane applies the new config (pageserver
connstring + stripe size)
* pageserver connstring will be updated immediately once the postmaster
receives the SIGHUP, and it will be copied over the the shared memory of
all other backends.
* stripe size is a normal GUC and we don't have special handling around
that, so if any active backend has ongoing txns the value won't be
applied.
* now it's possible for backends to issue requests based on the wrong
stripe size; what's worse, if a request gets cached in the prefetch
buffer, it will get stuck forever.

## Summary of changes

To make sure it aligns with the current default in storcon.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-10 21:49:52 +00:00
Erik Grinaker
44ea17b7b2 pageserver/page_api: add attempt to GetPage request ID (#12536)
## Problem

`GetPageRequest::request_id` is supposed to be a unique ID for a
request. It's not, because we may retry the request using the same ID.
This causes assertion failures and confusion.

Touches #11735.
Requires #12480.

## Summary of changes

Extend the request ID with a retry attempt, and handle it in the gRPC
client and server.
2025-07-10 20:39:42 +00:00
Tristan Partin
1b7339b53e PG: add max_wal_rate (#12470)
## Problem
One PG tenant may write too fast and overwhelm the PS. The other tenants
sharing the same PSs will get very little bandwidth.

We had one experiment that two tenants sharing the same PSs. One tenant
runs a large ingestion that delivers hundreds of MB/s while the other
only get < 10 MB/s.

## Summary of changes
Rate limit how fast PG can generate WALs. The default is -1. We may
scale the default value with the CPU count. Need to run some experiments
to verify.

## How is this tested?
CI.

PGBench. No limit first. Then set to 1 MB/s and you can see the tps
drop. Then reverted the change and tps increased again.

pgbench -i -s 10 -p 55432 -h 127.0.0.1 -U cloud_admin -d postgres
pgbench postgres -c 10 -j 10 -T 6000000 -P 1 -b tpcb-like -h 127.0.0.1
-U cloud_admin -p 55432
progress: 33.0 s, 986.0 tps, lat 10.142 ms stddev 3.856 progress: 34.0
s, 973.0 tps, lat 10.299 ms stddev 3.857 progress: 35.0 s, 1004.0 tps,
lat 9.939 ms stddev 3.604 progress: 36.0 s, 984.0 tps, lat 10.183 ms
stddev 3.713 progress: 37.0 s, 998.0 tps, lat 10.004 ms stddev 3.668
progress: 38.0 s, 648.9 tps, lat 12.947 ms stddev 24.970 progress: 39.0
s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 40.0 s, 0.0 tps, lat
0.000 ms stddev 0.000 progress: 41.0 s, 0.0 tps, lat 0.000 ms stddev
0.000 progress: 42.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress:
43.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 44.0 s, 0.0 tps,
lat 0.000 ms stddev 0.000 progress: 45.0 s, 0.0 tps, lat 0.000 ms stddev
0.000 progress: 46.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress:
47.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 48.0 s, 0.0 tps,
lat 0.000 ms stddev 0.000 progress: 49.0 s, 347.3 tps, lat 321.560 ms
stddev 1805.633 progress: 50.0 s, 346.8 tps, lat 9.898 ms stddev 3.809
progress: 51.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 52.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 53.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 54.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 55.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 56.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 57.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 58.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 59.0 s, 0.0 tps, lat 0.000 ms stddev 0.000 progress: 60.0 s,
0.0 tps, lat 0.000 ms stddev 0.000 progress: 61.0 s, 0.0 tps, lat 0.000
ms stddev 0.000 progress: 62.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 63.0 s, 494.5 tps, lat 276.504 ms stddev 1853.689 progress:
64.0 s, 488.0 tps, lat 20.530 ms stddev 71.981 progress: 65.0 s, 407.8
tps, lat 9.502 ms stddev 3.329 progress: 66.0 s, 0.0 tps, lat 0.000 ms
stddev 0.000 progress: 67.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 68.0 s, 504.5 tps, lat 71.627 ms stddev 397.733 progress: 69.0
s, 371.0 tps, lat 24.898 ms stddev 29.007 progress: 70.0 s, 541.0 tps,
lat 19.684 ms stddev 24.094 progress: 71.0 s, 342.0 tps, lat 29.542 ms
stddev 54.935

Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
2025-07-10 20:34:11 +00:00
Mikhail
3593fe195a split TerminationPending into two values, keeping ComputeStatus stateless (#12506)
After https://github.com/neondatabase/neon/pull/12240 we observed
issues in our go code as `ComputeStatus` is not stateless, thus doesn't
deserialize as string.

```
could not check compute activity: json: cannot unmarshal object into Go struct field
ComputeState.status of type computeclient.ComputeStatus
```

- Fix this by splitting this status into two.
- Update compute OpenApi spec to reflect changes to `/terminate` in
previous PR
2025-07-10 19:28:10 +00:00
Mikhail
c5aaf1ae21 Qualify call to neon extension in compute_ctl's prewarming (#12554)
https://github.com/neondatabase/cloud/issues/19011
Calls without `neon.` failed on staging.
Also fix local tests to work with qualified calls
2025-07-10 18:37:54 +00:00
Alex Chi Z.
13b5e7b26f fix(compute_ctl): reload config before applying spec (#12551)
## Problem

If we have catalog update AND a pageserver migration batched in a single
spec, we will not be able to apply the spec (running the SQL) because
the compute is not attached to the right pageserver and we are not able
to read anything if we don't pick up the latest pageserver connstring.
This is not a case for now because cplane always schedules shard split /
pageserver migrations with `skip_pg_catalog_updates` (I suppose).

Context:
https://databricks.slack.com/archives/C09254R641L/p1752163559259399?thread_ts=1752160163.141149&cid=C09254R641L

With this fix, backpressure will likely not be able to affect
reconfigurations.

## Summary of changes

Do `pg_reload_conf` before we apply specs in SQL.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-10 18:02:54 +00:00
Erik Grinaker
dcdfe80bf0 pagebench: add support for rich gRPC client (#12477)
## Problem

We need to benchmark the rich gRPC client
`client_grpc::PageserverClient` against the basic, no-frills
`page_api::Client` to determine how much overhead it adds.

Touches #11735.
Requires #12476.

## Summary of changes

Add a `pagebench --rich-client` parameter to use
`client_grpc::PageserverClient`. Also adds a compression parameter to
the client.
2025-07-10 17:30:09 +00:00
Alexander Bayandin
8630d37f5e test_runner: manually reuse ports in PortDistributor (#12423)
## Problem

Sometimes we run out of free ports in `PortDistributor`. This affects
particularly failed tests that we rerun automatically up to 3 times
(which makes it use up to 3x more ports)

## Summary of changes
- Cycle over the range of ports to reuse freed ports from previous tests

Ref: LKB-62
2025-07-10 15:53:38 +00:00
Erik Grinaker
2fc77c836b pageserver/client_grpc: add shard map updates (#12480)
## Problem

The communicator gRPC client must support changing the shard map on
splits.

Touches #11735.
Requires #12476.

## Summary of changes

* Wrap the shard set in a `ArcSwap` to allow swapping it out.
* Add a new `ShardSpec` parameter struct to pass validated shard info to
the client.
* Add `update_shards()` to change the shard set. In-flight requests are
allowed to complete using the old shards.
* Restructure `get_page` to use a stable view of the shard map, and
retry errors at the top (pre-split) level to pick up shard map changes.
* Also marks `tonic::Status::Internal` as non-retryable, so that we can
use it for client-side invariant checks without continually retrying
these.
2025-07-10 15:46:39 +00:00
HaoyuHuang
2c6b327be6 A few PS changes (#12540)
# TLDR
All changes are no-op except some metrics. 

## Summary of changes I
### Pageserver
Added a new global counter metric
`pageserver_pagestream_handler_results_total` that categorizes
pagestream request results according to their outcomes:
1. Success
2. Internal errors
3. Other errors

Internal errors include:
1. Page reconstruction error: This probably indicates a pageserver
bug/corruption
2. LSN timeout error: Could indicate overload or bugs with PS's ability
to reach other components
3. Misrouted request error: Indicates bugs in the Storage Controller/HCC

Other errors include transient errors that are expected during normal
operation or errors indicating bugs with other parts of the system
(e.g., malformed requests, errors due to cancelled operations during PS
shutdown, etc.)    


## Summary of changes II
This PR adds a pageserver endpoint and its counterpart in storage
controller to list visible size of all tenant shards. This will be a
prerequisite of the tenant rebalance command.


## Problem III
We need a way to download WAL
segments/layerfiles from S3 and replay WAL records. We cannot access
production S3 from our laptops directly, and we also can't transfer any
user data out of production systems for GDPR compliance, so we need
solutions.

## Summary of changes III

This PR adds a couple of tools to support the debugging
workflow in production:
1. A new `pagectl download-remote-object` command that can be used to
download remote storage objects assuming the correct access is set up.

## Summary of changes IV
This PR adds a command to list all visible delta and image layers from
index_part. This is useful to debug compaction issues as index_part
often contain a lot of covered layers due to PITR.

---------

Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
2025-07-10 14:39:38 +00:00
Alex Chi Z.
be5bbaecad fix(storcon): correctly handle 404 error in lsn lease (#12537)
## Problem

close LKB-253

## Summary of changes

404 for timeline requests could happen when the tenant is intended to be
on a pageserver but not attached yet. This patch adds handling for the
lease request. In the future, we should extend this handling to more
operations.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-10 14:28:58 +00:00
Arpad Müller
d33b3c7457 Print viability via custom printing impl (#12544)
As per
https://github.com/neondatabase/neon/pull/12485#issuecomment-3056525882
,

we don't want to print the viability error via a debug impl as it prints
the backtrace. SafekeeperInfo doesn't have a display impl, so fall back
to `Debug` for the `Ok` case. It gives single line output so it's okay
to use `Debug` for it.

Follow up of https://github.com/neondatabase/neon/pull/12485
2025-07-10 14:03:20 +00:00
Vlad Lazar
ffeede085e libs: move metric collection for pageserver and safekeeper in a background task (#12525)
## Problem

Safekeeper and pageserver metrics collection might time out. We've seen
this in both hadron and neon.

## Summary of changes

This PR moves metrics collection in PS/SK to the background so that we
will always get some metrics, despite there may be some delays. Will
leave it to the future work to reduce metrics collection time.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-10 11:58:22 +00:00
Mikhail
bdca5b500b Fix test_lfc_prewarm: reduce number of prewarms, sleep before LFC offloading (#12515)
Fixes:
- Sleep before LFC offloading in `test_lfc_prewarm[autoprewarm]` to
ensure offloaded LFC is the one exported after all writes finish
- Reduce number of prewarms and increase timeout in
`test_lfc_prewarm_under_workload` as debug builds were failing due to
timeout.

Additional changes:
- Remove `check_pinned_entries`:
https://github.com/neondatabase/neon/pull/12447#discussion_r2185946210
- Fix LFC error metrics description:
https://github.com/neondatabase/neon/pull/12486#discussion_r2190763107
2025-07-10 11:11:53 +00:00
Erik Grinaker
f4b03ddd7b pageserver/client_grpc: reap idle pool resources (#12476)
## Problem

The gRPC client pools don't reap idle resources.

Touches #11735.
Requires #12475.

## Summary of changes

Reap idle pool resources (channels/clients/streams) after 3 minutes of
inactivity.

Also restructure the `StreamPool` to use a mutex rather than atomics for
synchronization, for simplicity. This will be optimized later.
2025-07-10 10:18:37 +00:00
Vlad Lazar
08b19f001c pageserver: optionally force image layer creation on timeout (#12529)
This PR introduces a `image_creation_timeout` to page servers so that we
can force the image creation after a certain period. This is set to 1
day on dev/staging for now, and will rollout to production 1/2 weeks
later.

Majority of the PR are boilerplate code to add the new knob. Specific
changes of the PR are:
1. During L0 compaction, check if we should force a compaction if
min(LSN) of all delta layers < force_image_creation LSN.
2. During image creation, check if we should force a compaction if the
image's LSN < force_image_creation LSN and there are newer deltas with
overlapping key ranges.
3. Also tweaked the check image creation interval to make sure we honor
image_creation_timeout.

Vlad's note: This should be a no-op. I added an extra PS config for the
large timeline
threshold to enable this.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-10 10:07:21 +00:00
Dimitri Fontaine
1a45b2ec90 Review security model for executing Event Trigger code. (#12463)
When a function is owned by a superuser (bootstrap user or otherwise),
we consider it safe to run it. Only a superuser could have installed it,
typically from CREATE EXTENSION script: we trust the code to execute.

## Problem

This is intended to solve running pg_graphql Event Triggers
graphql_watch_ddl and graphql_watch_drop which are executing the secdef
function graphql.increment_schema_version().

## Summary of changes

Allow executing Event Trigger function owned by a superuser and with
SECURITY DEFINER properties. The Event Trigger code runs with superuser
privileges, and we consider that it's fine.

---------

Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-10 08:06:33 +00:00
Tristan Partin
13e38a58a1 Grant pg_signal_backend to neon_superuser (#12533)
Allow neon_superuser to cancel backends from non-neon_superusers,
excluding Postgres superusers.

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
Co-authored-by: Vikas Jain <vikas.jain@databricks.com>
2025-07-09 21:35:39 +00:00
Christian Schwarz
2edd59aefb impr(compaction): unify checking of CompactionError for cancellation reason (#12392)
There are a couple of places that call `CompactionError::is_cancel` but
don't check the `::Other` variant via downcasting for root cause being
cancellation.
The only place that does it is `log_compaction_error`.
It's sad we have to do it, but, until we get around cleaning up all the
culprits,
a step forward is to unify the behavior so that all places that inspect
a
`CompactionError` for cancellation reason follow the same behavior.

Thus, this PR ...
- moves the downcasting checks against the `::Other` variant from
  `log_compaction_error` into `is_cancel()` and
- enforces via type system that `.is_cancel()` is used to check whether
  a CompactionError is due to cancellation (matching on the
  `CompactionError::ShuttingDown` will cause a compile-time error).

I don't think there's a _serious_ case right now where matching instead
of using `is_cancel` causes problems.
The worst case I could find is the circuit breaker and
`compaction_failed`,
which don't really matter if we're shutting down the timeline anyway.
But it's unaesthetic and might cause log/alert noise down the line,
so, this PR fixes that at least.

Refs
- https://databricks.atlassian.net/browse/LKB-182
- slack conversation about this PR:
https://databricks.slack.com/archives/C09254R641L/p1751284317955159
2025-07-09 21:15:44 +00:00
Alex Chi Z.
0b639ba608 fix(storcon): correctly pass through lease error code (#12519)
## Problem

close LKB-199

## Summary of changes

We always return the error as 500 to the cplane if a LSN lease request
fails. This cause issues for the cplane as they don't retry on 500. This
patch correctly passes through the error and assign the error code so
that cplane can know if it is a retryable error. (TODO: look at the
cplane code and learn the retry logic).

Note that this patch does not resolve LKB-253 -- we need to handle not
found error separately in the lsn lease path, like wait until the tenant
gets attached, or return 503 so that cplane can retry.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-09 20:22:55 +00:00
Tristan Partin
28f604d628 Make pg_monitor neon_superuser test more robust (#12532)
Make sure to check for NULL just in case.

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
Co-authored-by: Vikas Jain <vikas.jain@databricks.com>
2025-07-09 18:45:50 +00:00
Vlad Lazar
fe0ddb7169 libs: make remote storage failure injection probabilistic (#12526)
Change the unreliable storage wrapper to fail by probability when there
are more failure attempts left.

Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>
2025-07-09 17:41:34 +00:00
Dmitrii Kovalkov
4bbabc092a tests: wait for flush lsn in test_branch_creation_before_gc (#12527)
## Problem
Test `test_branch_creation_before_gc` is flaky in the internal repo.
Pageserver sometimes lags behind write LSN. When we call GC it might not
reach the LSN we try to create the branch at yet.

## Summary of changes
- Wait till flush lsn on pageserver reaches the latest LSN before
calling GC.
2025-07-09 17:16:06 +00:00
Tristan Partin
12c26243fc Fix typo in migration testing related to pg_monitor (#12530)
We should be joining on the neon_superuser roleid, not the pg_monitor
roleid.

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-09 16:47:21 +00:00
Erik Grinaker
2f71eda00f pageserver/client_grpc: add separate pools for bulk requests (#12475)
## Problem

GetPage bulk requests such as prefetches and vacuum can head-of-line
block foreground requests, causing increased latency.

Touches #11735.
Requires #12469.

## Summary of changes

* Use dedicated channel/client/stream pools for bulk GetPage requests.
* Use lower concurrency but higher queue depth for bulk pools.
* Make pool limits configurable.
* Require unbounded client pool for stream pool, to avoid accidental
starvation.
2025-07-09 16:12:59 +00:00
Alex Chi Z.
5ec82105cc fix(pageserver): ensure remote size gets computed (#12520)
## Problem

Follow up of #12400 

## Summary of changes

We didn't set remote_size_mb to Some when initialized so it never gets
computed :(

Also added a new API to force refresh the properties.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-09 15:35:19 +00:00
a-masterov
78a6daa874 Add retrying in Random ops test if parent branch is not found. (#12513)
## Problem
Due to a lag in replication, we sometimes cannot get the parent branch
definition just after completion of the Public API restore call. This
leads to the test failures.
https://databricks.atlassian.net/browse/LKB-279
## Summary of changes
The workaround is implemented. Now test retries up to 30 seconds,
waiting for the branch definition to appear.

---------

Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>
2025-07-09 15:28:04 +00:00
Alexander Lakhin
5c0de4ee8c Fix parameter name in workload for test_multiple_subscription_branching (#12522)
## Problem

As discovered in https://github.com/neondatabase/neon/issues/12394,
test_multiple_subscription_branching generates skewed data distribution,
that leads to test failures when the unevenly filled last table receives
even more data.
for table t0: pub_res = (42001,), sub_res = (42001,)
for table t1: pub_res = (29001,), sub_res = (29001,)
for table t2: pub_res = (21001,), sub_res = (21001,)
for table t3: pub_res = (21001,), sub_res = (21001,)
for table t4: pub_res = (1711001,), sub_res = (1711001,)
 
## Summary of changes
Fix the name of the workload parameter to generate data as expected.
2025-07-09 15:22:54 +00:00
Mikhail
bc6a756f1c ci: lint openapi specs using redocly (#12510)
We need to lint specs for pageserver, endpoint storage, and safekeeper
#0000
2025-07-09 14:29:45 +00:00
Erik Grinaker
8f3351fa91 pageserver/client_grpc: split GetPage batches across shards (#12469)
## Problem

The rich gRPC Pageserver client needs to split GetPage batches that
straddle multiple shards.

Touches #11735.
Requires #12462.

## Summary of changes

Adds a `GetPageSplitter` which splits `GetPageRequest` that span
multiple shards, and then reassembles the responses. Dispatches
per-shard requests in parallel.
2025-07-09 14:17:22 +00:00
Mikhail
e7d18bc188 Replica promotion in compute_ctl (#12183)
Add `/promote` method for `compute_ctl` promoting secondary replica to
primary,
depends on secondary being prewarmed.
Add `compute-ctl` mode to `test_replica_promotes`, testing happy path
only (no corner cases yet)
Add openapi spec for `/promote` and `/lfc` handlers

https://github.com/neondatabase/cloud/issues/19011
Resolves: https://github.com/neondatabase/cloud/issues/29807
2025-07-09 12:55:10 +00:00
Konstantin Knizhnik
4ee0da0a20 Check prefetch response before assignment to slot (#12371)
## Problem

See [Slack
Channel](https://databricks.enterprise.slack.com/archives/C091LHU6NNB)

Dropping connection without resetting prefetch state can cause
request/response mismatch.
And lack of check response correctness in communicator_prefetch_lookupv
can cause data corruption.

## Summary of changes

1. Validate response before assignment to prefetch slot.
2. Consume prefetch requests before sending any other requests.

---------

Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-07-09 12:49:21 +00:00
Arpad Müller
7049003cf7 storcon: print viability of --timelines-onto-safekeepers (#12485)
The `--timelines-onto-safekeepers` flag is very consequential in the
sense that it controls every single timeline creation. However, we don't
have any automatic insight whether enabling the option will break things
or not.

The main way things can break is by misconfigured safekeepers, say they
are marked as paused in the storcon db. The best input so far we can
obtain via manually connecting via storcon_cli and listing safekeepers,
but this is cumbersome and manual so prone to human error.

So at storcon startup, do a simulated "test creation" in which we call
`timelines_onto_safekeepers` with the configuration provided to us, and
print whether it was successful or not. No actual timeline is created,
and nothing is written into the storcon db. The heartbeat info will not
have reached us at that point yet, but that's okay, because we still
fall back to safekeepers that don't have any heartbeat.

Also print some general scheduling policy stats on initial safekeeper
load.

Part of #11670.
2025-07-09 12:02:44 +00:00
Erik Grinaker
3915995530 pageserver/client_grpc: add rich Pageserver gRPC client (#12462)
## Problem

For the communicator, we need a rich Pageserver gRPC client.

Touches #11735.
Requires #12434.

## Summary of changes

This patch adds an initial rich Pageserver gRPC client. It supports:

* Sharded tenants across multiple Pageservers.
* Pooling of connections, clients, and streams for efficient resource
use.
* Concurrent use by many callers.
* Internal handling of GetPage bidirectional streams, with pipelining
and error handling.
* Automatic retries.
* Observability.

The client is still under development. In particular, it needs GetPage
batch splitting, shard map updates, and performance optimization. This
will be addressed in follow-up PRs.
2025-07-09 11:42:46 +00:00
Folke Behrens
5ea0bb2d4f proxy: Drop unused metrics (#12521)
* proxy_control_plane_token_acquire_seconds
* proxy_allowed_ips_cache_misses
* proxy_vpc_endpoint_id_cache_stats
* proxy_access_blocker_flags_cache_stats
* proxy_requests_auth_rate_limits_total
* proxy_endpoints_auth_rate_limits
* proxy_invalid_endpoints_total
2025-07-09 09:58:46 +00:00
Christian Schwarz
aac1f8efb1 refactor(compaction): eliminate CompactionError::CollectKeyspaceError variant (#12517)
The only differentiated handling of it is for `is_critical`, which in
turn is a `matches!()` on several variants of the `enum
CollectKeyspaceError`
which is the value contained insided
`CompactionError::CollectKeyspaceError`.

This PR introduces a new error for `repartition()`, allowing its
immediate
callers to inspect it like `is_critical` did.

A drive-by fix is more precise classification of WaitLsnError::BadState
when mapping to `tonic::Status`.

refs
- https://databricks.atlassian.net/browse/LKB-182
2025-07-09 08:41:36 +00:00
Alex Chi Z.
43dbded8c8 fix(pageserver): disallow lease creation below the applied gc cutoff (#12489)
## Problem

close LKB-209

## Summary of changes

- We should not allow lease creation below the applied gc cutoff.
- Also removed the condition for `AttachedSingle`. We should always
check the lease against the gc cutoff in all attach modes.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-08 22:32:51 +00:00