8470 Commits

Author SHA1 Message Date
Stas Kelvich
015b1c7cb3 Update README (#12827)
Subj
2025-10-03 15:07:57 -07:00
dotdister
5e85c02f37 neon_local: fix mismatched comment about local SSL certificate generation (#12814)
## Problem
In control_plane, the local SSL certificate generation uses `ed25519`,
but the comment still remained `rsa:2048`, resulting in a mismatch.
This mismatch was introduced in #11542.

## Summary of changes
The comment has been corrected from `rsa:2048` to `ed25519` to ensure
consistency with the implementation.
2025-09-30 13:38:49 +02:00
Yongtao Huang
c17d3fe645 Fix typos (#12819)
Fix typo: `Falied` -> `Failed`

Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
2025-09-30 13:37:00 +02:00
dotdister
4ac447c75d fix(control_plane): Fix incorrect file path of identity.toml in error message (#12826)
## Problem
Control_plane shows incorrect file path when identity.toml file open
fails.

## Summary of changes
In the error context, when writing identity.toml, I changed it to use
identity_file_path instead of config_file_path.
2025-09-30 13:35:42 +02:00
Junhyeog Lee
26b47b5beb feat: Add configurable Direct IO alignment support (#12821)
## Problem

Neon's storage system currently has hard-coded 512-byte block size for
Direct IO operations, which causes I/O errors on systems with disks that
have 4096-byte block sizes.

This results in errors like "vec read failed" and "Invalid argument (os
error 22)" on certain hardware configurations.

See issue #12623 for details.

## Summary of changes

Make Direct IO alignment configurable at build time to support both
512-byte and 4096-byte block sizes:

- Add `io-align-512` and `io-align-4k` cargo features (default: 512-byte
for backward compatibility)
- Make `DEFAULT_IO_BUFFER_ALIGNMENT` configurable via cargo features in
`pageserver_api`
- Update `DIO_CHUNK_SIZE` in vectored_dio_read to use the configured
alignment value dynamically
- Add `IO_ALIGNMENT` build argument to Dockerfile to allow building
images with different alignment settings
- Add startup logging to display the configured IO buffer alignment for
operational visibility
- Fix validation logic in `virtual_file.rs` to use the configured
alignment instead of hard-coded 512

This change allows Neon to run on systems with different disk block
sizes by building with the appropriate feature flag, addressing the
compatibility issues described in the RFC on Direct IO implementation

## Performance Note

Benchmarks show 512-byte alignment performs significantly better than
4k:
- Write: 512-byte is 21-71% faster across percentiles (p99: 71% faster)
  - Read: 512-byte is slightly faster (5-21% improvement)

This is why 512-byte remains the default.
However, some storage systems require 4k alignment and will fail with
EINVAL otherwise. This change adds build-time configuration to support
both environments.
2025-09-26 14:43:53 +01:00
John G. Crowley
85ce109361 Initial implementation of GCS provider. (#11666)
## Problem
We are currently using GCS through the AWS API instead of directly to
the GCS API.

## Summary of changes
Draft implementation of a GCS provider. We run Neon on GCS with the AWS
provider via [this
patch](https://github.com/neondatabase/neon/pull/10277), but want to use
GCS API directly. This implementation attempts to do so without adding a
GCS library dependency or new SDK, except for `gcp_auth`.
2025-09-16 10:18:25 +02:00
Peter Bendel
77e22e4bf0 remove obsolete comment - this is a dummy commit (#12816)
## Problem

we ran out of commit comment on same commit sha,
[see](https://github.com/neondatabase/neon/actions/runs/17190868211/job/48766305883#step:10:591)

## Summary of changes

Push another commit to neondatabase/neon.git to create a new commit sha
on main branch
2025-08-25 07:36:41 +00:00
Ruslan Talpa
d96cea1917 [proxy] handle options request in rest broker (cors headers) (#12744)
## Problem
rest broker needs to respond with the correct cors headers for the api
to be usable from other domains

## Summary of changes
added a code path in rest broker to handle the OPTIONS requests

---------

Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>
2025-07-31 13:05:09 +00:00
Dmitrii Kovalkov
312a74f11f storcon: implement safekeeper_migrate_abort handler (#12705)
## Problem
Right now if we commit a joint configuration to DB, there is no way
back. The only way to get the clean mconf is to continue the migration.
The RFC also described an abort mechanism, which allows to abort current
migration and revert mconf change. It might be needed if the migration
is stuck and cannot have any progress, e.g. if the sk we are migrating
to went down during the migration. This PR implements this abort
algorithm.

- Closes: https://databricks.atlassian.net/browse/LKB-899
- Closes: https://github.com/neondatabase/neon/issues/12549

## Summary of changes
- Implement `safekeeper_migrate_abort` handler with the algorithm
described in RFC
- Add `timeline-safekeeper-migrate-abort` subcommand to `storcon_cli`
- Add test for the migration abort algorithm.
2025-07-31 12:40:32 +00:00
Mikhail
df4e37b7cc Report timespans for promotion and prewarm (#12730)
- Return sub-actions time spans for prewarm, prewarm offload, and
promotion in http handlers.
- Set `synchronous_standby_names=walproposer` for promoted endpoints.
Otherwise, walproposer on promoted standby ignores reply from safekeeper
and is stuck on lsn COMMIT eternally.
2025-07-31 11:51:19 +00:00
Heikki Linnakangas
b4a63e0a34 Fix how neon.stripe_size option is set in postgresql.conf file (#12776)
Commit 1dce2a9e74 changed how the `neon.pageserver_connstring` setting
is formed, but it messed up setting the `neon.stripe_size` setting so
that it was set twice. That got mixed up during development of the
patch, as commit 7fef4435c1 landed first and was merged incorrectly.
2025-07-31 11:46:57 +00:00
Erik Grinaker
f8fc0bf3c0 neon_local: use doc comments for help texts (#12270)
Clap automatically uses doc comments as help/about texts. Doc comments
are strictly better, since they're also used e.g. for IDE documentation,
and are better formatted.

This patch updates all `neon_local` commands to use doc comments
(courtesy of GPT-o3).
2025-07-31 10:25:33 +00:00
Alexey Kondratov
8fe7596120 chore(compute_tools): Delete unused anon_ext_fn_reassign.sql (#12787)
It's an anon v1 failed launch artifact, I suppose.
2025-07-31 10:11:30 +00:00
Krzysztof Szafrański
f3ee6e818d [proxy] Correctly classify ConnectErrors (#12793)
As is, e.g. quota errors on wake compute are logged as "compute" errors.
2025-07-31 09:53:48 +00:00
Dmitrii Kovalkov
edd60730c8 safekeeper: use last_log_term in mconf switch + choose most advanced sk in pull timeline (#12778)
## Problem
I discovered two bugs corresponding to safekeeper migration, which
together might lead to a data loss during the migration. The second bug
is from a hadron patch and might lead to a data loss during the
safekeeper restore in hadron as well.

1. `switch_membership` returns the current `term` instead of
`last_log_term`. It is used to choose the `sync_position` in the
algorithm, so we might choose the wrong one and break the correctness
guarantees.
2. The current `term` is used to choose the most advanced SK in
`pull_timeline` with higher priority than `flush_lsn`. It is incorrect
because the most advanced safekeeper is the one with the highest
`(last_log_term, flush_lsn)` pair. The compute might bump term on the
least advanced sk, making it the best choice to pull from, and thus
making committed log entries "uncommitted" after `pull_timeline`

Part of https://databricks.atlassian.net/browse/LKB-1017

## Summary of changes
- Return `last_log_term` in `switch_membership`
- Use `(last_log_term, flush_lsn)` as a primary key for choosing the
most advanced sk in `pull_timeline` and deny pulling if the `max_term`
is higher than on the most advanced sk (hadron only)
- Write tests for both cases
- Retry `sync_safekeepers` in `compute_ctl`
- Take into the account the quorum size when calculating `sync_position`
2025-07-31 09:29:25 +00:00
Aleksandr Sarantsev
975b95f4cd Introduce deletion API improvement RFC (#12484)
## Problem

The deletion logic had become difficult to understand and maintain.

## Summary of changes

- Added an RFC detailing proposed improvements to all deletion-related
APIs.

---------

Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
2025-07-31 08:34:47 +00:00
Mikhail
01c39f378e prewarm cancellation (#12785)
Add DELETE /lfc/prewarm route which handles ongoing prewarm
cancellation, update API spec, add prewarm Cancelled state
Add offload Cancelled state when LFC is not initialized
2025-07-30 22:05:51 +00:00
Dimitri Fontaine
4d3b28bd2e [Hadron] Always run databricks auth hook. (#12683) 2025-07-30 21:34:30 +00:00
Heikki Linnakangas
81ddd10be6 tests: Don't print Hostname on every test connection (#12782)
These lines are a significant fraction of the total log size of the
regression tests. And it seems very uninteresting, it's always
'localhost' in local tests.
2025-07-30 19:56:22 +00:00
Suhas Thalanki
e470997627 enable tests introduced in hadron commits (#12790)
Enables skipped tests introduced in hadron integration commits
2025-07-30 19:10:33 +00:00
Erik Grinaker
eb2741758b storcon: actually update gRPC address on reattach (#12784)
## Problem

In #12268, we added Pageserver gRPC addresses to the storage controller.
However, we didn't actually persist these in the database.

## Summary of changes

Update the database with the new gRPC address on reattach.
2025-07-30 16:18:35 +00:00
Matthias van de Meent
f3a0e4f255 Improve specificity with which we apply compute specs (#12773)
This makes sure we don't confuse user-controlled functions with PG's
builtin functions.

## Problem

See https://github.com/neondatabase/cloud/issues/31628
2025-07-30 15:29:16 +00:00
Suhas Thalanki
842a5091d5 [BRC-3051] Walproposer: Safekeeper quorum health metrics (#930) (#12750)
Today we don't have any indications (other than spammy logs in PG that
nobody monitors) if the Walproposer in PG cannot connect to/get votes
from all Safekeepers. This means we don't have signals indicating that
the Safekeepers are operating at degraded redundancy. We need these
signals.

Added plumbing in PG extension so that the `neon_perf_counters` view
exports the following gauge metrics on safekeeper health:
- `num_configured_safekeepers`: The total number of safekeepers
configured in PG.
- `num_active_safekeepers`: The number of safekeepers that PG is
actively streaming WAL to.

An alert should be raised whenever `num_active_safekeepers` <
`num_configured_safekeepers`.

The metrics are implemented by adding additional state to the
Walproposer shared memory keeping track of the active statuses of
safekeepers using a simple array. The status of the safekeeper is set to
active (1) after the Walproposer acquires a quorum and starts streaming
data to the safekeeper, and is set to inactive (0) when the connection
with a safekeeper is shut down. We scan the safekeeper status array in
Walproposer shared memory when collecting the metrics to produce results
for the gauges.

Added coverage for the metrics to integration test
`test_wal_acceptor.py::test_timeline_disk_usage_limit`.

## Problem

## Summary of changes

---------

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-30 15:14:59 +00:00
Suhas Thalanki
056056bef0 fix(compute): validate prewarm_local_cache() input (#12648)
## Problem
```
postgres=> select neon.prewarm_local_cache('\xfcfcfcfc01000000ffffffff070000000000000000000000000000000000000000000000000000000000000000000000000000ff', 1);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
FATAL:  server conn crashed?
```

The function takes a bytea argument and casts it to a C struct, without
validating the contents.

## Summary of changes

Added validation for number of pages to be prefetched and for the chunks
as well.
2025-07-30 14:33:19 +00:00
Ruslan Talpa
e989e0da78 [proxy] accept jwts when configured as rest_broker (#12777)
## Problem

when compiled with rest_broker feature and is_rest_broker=true (but
is_auth_broker=false) accept_jwts is set to false

## Summary of changes
set the config with
```
accept_jwts: args.is_auth_broker || args.is_rest_broker
```

Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>
2025-07-30 14:17:51 +00:00
Heikki Linnakangas
b3c1aecd11 tests: Stop endpoints in parallel (#12769)
Shaves off a few seconds from tests involving multiple endpoints.
2025-07-30 12:19:00 +00:00
Heikki Linnakangas
1dce2a9e74 Change how pageserver connection info is passed in compute spec (#12604)
Add a new 'pageserver_connection_info' field in the compute spec. It
replaces the old 'pageserver_connstring' field with a more complicated
struct that includes both libpq and grpc URLs, for each shard (or only
one of the the URLs, depending on the configuration). It also includes a
flag suggesting which one to use; compute_ctl now uses it to decide
which protocol to use for the basebackup.

This is backwards-compatible with everything that's in production. If
the control plane fills in `pageserver_connection_info`, compute_ctl
uses that. If it fills in the
`pageserver_connstring`/`shard_stripe_size` fields, it uses those. As
last resort, it uses the 'neon.pageserver_connstring' GUC from the list
of Postgres settings.

The 'grpc' flag in the endpoint config is now more of a suggestion, and
it's used to populate the 'prefer_protocol' flag in the compute spec.
Regardless of the flag, compute_ctl gets both URLs, so it can choose to
use libpq or grpc as it wishes. It currently always obeys the flag to
choose which method to use for getting the basebackup, but Postgres
itself will always use the libpq protocol. (That will be changed with
the new rust-based communicator project, which implements the gRPC
client in the compute).

After that, the `pageserver_connection_info.prefer_protocol` flag in the
spec file can be used to control whether compute_ctl uses grpc or libpq.
The actual compute's grpc usage will be controlled by the
`neon.enable_new_communicator` GUC (not yet; that will be introduced in
the future, with the new rust-base communicator project). It can be set
separately from 'prefer_protocol'.

Later:

- Once all old computes are gone, remove the code to pass
`neon.pageserver_connstring`
2025-07-29 22:20:05 +00:00
HaoyuHuang
ca88521653 Set neon_superuser privilege under lakebase mode (#12775)
## Problem

## Summary of changes
2025-07-29 21:30:34 +00:00
Suhas Thalanki
07c3cfd2a0 [BRC-2905] Feed back PS-detected data corruption signals to SK and PG… (#12748)
… walproposer (#895)

Data corruptions are typically detected on the pageserver side when it
replays WAL records. However, since PS doesn't synchronously replay WAL
records as they are being ingested through safekeepers, we need some
extra plumbing to feed information about pageserver-detected corruptions
during compaction (and/or WAL redo in general) back to SK and PG for
proper action.

We don't yet know what actions PG/SK should take upon receiving the
signal, but we should have the detection and feedback in place.

Add an extra `corruption_detected` field to the `PageserverFeedback`
message that is sent from PS -> SK -> PG. It's a boolean value that is
set to true when PS detects a "critical error" that signals data
corruption, and it's sent in all `PageserverFeedback` messages. Upon
receiving this signal, the safekeeper raises a
`safekeeper_ps_corruption_detected` gauge metric (value set to 1). The
safekeeper then forwards this signal to PG where a
`ps_corruption_detected` gauge metric (value also set to 1) is raised in
the `neon_perf_counters` view.

Added an integration test in
`test_compaction.py::test_ps_corruption_detection_feedback` that
confirms that the safekeeper and PG can receive the data corruption
signal in the `PageserverFeedback` message in a simulated data
corruption.

## Problem

## Summary of changes

---------

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-29 20:40:07 +00:00
Erik Grinaker
7cd0066212 page_api: add SplitError for GetPageSplitter (#12709)
Add a `SplitError` for `GetPageSplitter`, with an `Into<tonic::Status>`
implementation. This avoids a bunch of boilerplate to convert
`GetPageSplitter` errors into `tonic::Status`.

Requires #12702.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
2025-07-29 18:26:20 +00:00
Suhas Thalanki
bf3a1529bf Report metrics on data/index corruption (#12729)
## Problem

We don't have visibility into data/index corruption.

## Summary of changes
Add data/index corruptions metrics.

PG calls elog ERROR errcode to emit these corruption errors.

PG Changes: https://github.com/neondatabase/postgres/pull/698
2025-07-29 18:08:24 +00:00
Erik Grinaker
65d1be6e90 pageserver: route gRPC requests to child shards (#12702)
## Problem

During shard splits, each parent shard is split and removed
incrementally. Only when all parent shards have split is the split
committed and the compute notified. This can take several minutes for
large tenants. In the meanwhile, the compute will be sending requests to
the (now-removed) parent shards.

This was (mostly) not a problem for the libpq protocol, because it does
shard routing on the server-side. The compute just sends requests to
some Pageserver, and the server will figure out which local shard should
serve it.

It is a problem for the gRPC protocol, where the client explicitly says
which shard it's talking to.

Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
Requires #12772.

## Summary of changes

* Add server-side routing of gRPC requests to any local child shards if
the parent does not exist.
* Add server-side splitting of GetPage batch requests straddling
multiple child shards.
* Move the `GetPageSplitter` into `pageserver_page_api`.

I really don't like this approach, but it avoids making changes to the
split protocol. I could be convinced we should change the split protocol
instead, e.g. to keep the parent shard alive until the split commits and
the compute has been notified, but we can also do that as a later change
without blocking the communicator on it.
2025-07-29 16:28:57 +00:00
Suhas Thalanki
16eb8dda3d some compute ctl changes from hadron (#12760)
Some compute ctl changes from hadron
2025-07-29 16:01:56 +00:00
Heikki Linnakangas
bb32f1b3d0 Move 'criterion' to a dev-dependency (#12762)
It is only used in micro-benchmarks.
2025-07-29 15:35:00 +00:00
a-masterov
5585c32cee Disable autovacuum while running pg_repack test (#12755)
## Problem
Sometimes, the regression test of `pg_repack` fails due to an extra line
in the output.
The most probable cause of this is autovacuum.  
https://databricks.atlassian.net/browse/LKB-2637
## Summary of changes
Autovacuum is disabled during the test.

Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>
2025-07-29 15:34:02 +00:00
Krzysztof Szafrański
0ffdc98e20 [proxy] Classify "database not found" errors as user errors (#12603)
## Problem

If a user provides a wrong database name in the connection string, it
should be logged as a user error, not postgres error.

I found 4 different places where we log such errors:
1. `proxy/src/stream.rs:193`, e.g.:
```
{"timestamp":"2025-07-15T11:33:35.660026Z","level":"INFO","message":"forwarding error to user","fields":{"kind":"postgres","msg":"database \"[redacted]\" does not exist"},"spans":{"connect_request#9":{"protocol":"tcp","session_id":"ce1f2c90-dfb5-44f7-b9e9-8b8535e8b9b8","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":22,"task_id":"370407867","target":"proxy::stream","src":"proxy/src/stream.rs:193","extract":{"ep":"[redacted]","session_id":"ce1f2c90-dfb5-44f7-b9e9-8b8535e8b9b8"}}
```
2. `proxy/src/pglb/mod.rs:137`, e.g.:
```
{"timestamp":"2025-07-15T11:37:44.340497Z","level":"WARN","message":"per-client task finished with an error: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist","spans":{"connect_request#8":{"protocol":"tcp","session_id":"763baaac-d039-4f4d-9446-c149e32660eb","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":14,"task_id":"866658139","target":"proxy::pglb","src":"proxy/src/pglb/mod.rs:137","extract":{"ep":"[redacted]","session_id":"763baaac-d039-4f4d-9446-c149e32660eb"}}
```
3. `proxy/src/serverless/mod.rs:451`, e.g. (note that the error is
repeated 4 times — retries?):
```
{"timestamp":"2025-07-15T11:37:54.515891Z","level":"WARN","message":"error in websocket connection: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist: db error: FATAL: database \"[redacted]\" does not exist: FATAL: database \"[redacted]\" does not exist","spans":{"http_conn#8":{"conn_id":"ec7780db-a145-4f0e-90df-0ba35f41b828"},"connect_request#9":{"protocol":"ws","session_id":"1eaaeeec-b671-4153-b1f4-247839e4b1c7","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":10,"task_id":"366331699","target":"proxy::serverless","src":"proxy/src/serverless/mod.rs:451","extract":{"conn_id":"ec7780db-a145-4f0e-90df-0ba35f41b828","ep":"[redacted]","session_id":"1eaaeeec-b671-4153-b1f4-247839e4b1c7"}}
```
4. `proxy/src/serverless/sql_over_http.rs:219`, e.g.
```
{"timestamp":"2025-07-15T10:32:34.866603Z","level":"INFO","message":"forwarding error to user","fields":{"kind":"postgres","error":"could not connect to postgres in compute","msg":"database \"[redacted]\" does not exist"},"spans":{"http_conn#19":{"conn_id":"7da08203-5dab-45e8-809f-503c9019ec6b"},"connect_request#5":{"protocol":"http","session_id":"68387f1c-cbc8-45b3-a7db-8bb1c55ca809","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":17,"task_id":"16432250","target":"proxy::serverless::sql_over_http","src":"proxy/src/serverless/sql_over_http.rs:219","extract":{"conn_id":"7da08203-5dab-45e8-809f-503c9019ec6b","ep":"[redacted]","session_id":"68387f1c-cbc8-45b3-a7db-8bb1c55ca809"}}
```

This PR directly addresses 1 and 4. I _think_ it _should_ also help with
2 and 3, although in those places we don't seem to log `kind`, so I'm
not quite sure. I'm also confused why in 3 the error is repeated
multiple times.

## Summary of changes

Resolves https://github.com/neondatabase/neon/issues/9440
2025-07-29 15:25:22 +00:00
HaoyuHuang
62d844e657 Add changes in spec apply (#12759)
## Problem
All changes are no-op. 

## Summary of changes
2025-07-29 15:22:04 +00:00
Alex Chi Z.
1bb434ab74 fix(test): test_readonly_node_gc compute needs time to acquire lease (#12747)
## Problem

Part of LKB-2368. Compute fails to obtain LSN lease in this test case.
There're many assumptions around how compute obtains the leases, and in
this particular test case, as the LSN lease length is only 8s (which is
shorter than the amount of time where pageserver can restart and compute
can reconnect in terms of force stop), it sometimes cause issues.

## Summary of changes

Add more sleeps around the test case to ensure it's stable at least. We
need to find a more reliable way to test this in the future.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-29 14:23:42 +00:00
Alex Chi Z.
dbde37c53a fix(safekeeper): retry if open segment fail (#12757)
## Problem

Fix LKB-2632.

The safekeeper wal read path does not seem to retry at all. This would
cause client read errors on the customer side.

## Summary of changes

- Retry on `safekeeper::wal_backup::read_object`.
- Note that this only retries on S3 HTTP connection errors. Subsequent
reads could fail, and that needs more refactors to make the retry
mechanism work across the path.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-29 14:20:43 +00:00
Heikki Linnakangas
5e3cb2ab07 Refactor LFC stats functions (#12696)
Split the functions into two parts: an internal function in file_cache.c
which returns an array of structs representing the result set, and
another function in neon.c with the glue code to expose it as a SQL
function. This is in preparation for the new communicator, which needs
to implement the same SQL functions, but getting the information from a
different place.

In the glue code, use the more modern Postgres way of building a result
set using a tuplestore.
2025-07-29 13:12:44 +00:00
Erik Grinaker
61f267d8f9 pageserver: only retry WaitForActiveTimeout during shard resolution (#12772)
## Problem

In https://github.com/neondatabase/neon/pull/12467, timeouts and retries
were added to `Cache::get` tenant shard resolution to paper over an
issue with read unavailability during shard splits. However, this
retries _all_ errors, including irrecoverable errors like `NotFound`.

This causes problems with gRPC child shard routing in #12702, which
targets specific shards with `ShardSelector::Known` and relies on prompt
`NotFound` errors to reroute requests to child shards. These retries
introduce a 1s delay for all reads during child routing.

The broader problem of read unavailability during shard splits is left
as future work, see https://databricks.atlassian.net/browse/LKB-672.

Touches #12702.
Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).

## Summary of changes

* Change `TenantManager` to always return a concrete
`GetActiveTimelineError`.
* Only retry `WaitForActiveTimeout` errors.
* Lots of code unindentation due to the simplified error handling.

Out of caution, we do not gate the retries on `ShardSelector`, since
this can trigger other races. Improvements here are left as future work.
2025-07-29 12:33:02 +00:00
JC Grünhage
e2411818ef Add SBOMs and provenance attestations to container images (#12768)
## Problem
Given a container image it is difficult to figure out dependencies and
doesn't work automatically.

## Summary of changes
- Build all rust binaries with `cargo auditable`, to allow sbom scanners
to find it's dependencies.
- Adjust `attests` for `docker/build-push-action`, so that buildkit
creates sbom and provenance attestations.
- Dropping `--locked` for `rustfilt`, because `rustfilt` can't build
with locked dependencies[^5]

## Further details
Building with `cargo auditable`[^1] embeds a dependency list into Linux,
Windows, MacOS and WebAssembly artifacts. A bunch of tools support
discovering dependencies from this, among them `syft`[^2], which is used
by the BuildKit Syft scanner[^3] plugin. This BuildKit plugin is the
default[^4] used in docker for generating sbom attestations, but we're
making that default explicit by referencing the container image.
[^1]: https://github.com/rust-secure-code/cargo-auditable
[^2]: https://github.com/anchore/syft
[^3]: https://github.com/docker/buildkit-syft-scanner
[^4]:
https://docs.docker.com/build/metadata/attestations/sbom/#sbom-generator
[^5]: https://github.com/luser/rustfilt/issues/23
2025-07-29 12:12:14 +00:00
Dmitrii Kovalkov
58327cbba8 storcon: wait for the migration from the drained node in the draining loop (#12754)
## Problem
We have seen some errors in staging when the shard migration was
triggered by optimizations, and it was ongoing during draining the node
it was migrating from. It happens because the node draining loop only
waits for the migrations started by the drain loop itself. The ongoing
migrations are ignored.

Closes: https://databricks.atlassian.net/browse/LKB-1625

## Summary of changes
- Wait for the shard reconciliation during the drain if it is being
migrated from the drained node.
2025-07-29 11:58:31 +00:00
Heikki Linnakangas
568927a8a0 Remove unnecessary dependency to 'log' crate (#12763)
We use 'tracing' everywhere.
2025-07-29 11:08:22 +00:00
a-masterov
1ed7252950 Add a workaround for the clickhouse 24.9+ problem causing an error (#12767)
## Problem
We used ClickHouse v. 24.8, which is outdated, for logical replication
testing. We could miss some problems.
## Summary of changes
The version was updated to 25.6, with a workaround using the environment
variable `PGSSLCERT`.

Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>
2025-07-29 10:19:10 +00:00
Alexander Bayandin
30b57334ef test_lsn_lease_storcon: ignore ShardSplit warning in debug builds (#12770)
## Problem

`test_lsn_lease_storcon` might fail in debug builds due to slow
ShardSplit

## Summary of changes
- Make `test_lsn_lease_storcon ` test to ignore `.*Exclusive lock by
ShardSplit was held.*` warning in debug builds

Ref: https://databricks.slack.com/archives/C09254R641L/p1753777051481029
2025-07-29 09:47:39 +00:00
Heikki Linnakangas
d487ba2b9b Replace 'memoffset' crate with core functionality (#12761)
The `std::mem::offset_of` macro was introduced in Rust 1.77.0.

In the passing, mark the function as `const`, as suggested in the
comment. Not sure which compiler version that requires, but it works
with what have currently.
2025-07-29 08:01:31 +00:00
Conrad Ludgate
e7a1d5de94 proxy: cache for password hashing (#12011)
## Problem

Password hashing for sql-over-http takes up a lot of CPU. Perhaps we can
get away with temporarily caching some steps so we only need fewer
rounds, which will save some CPU time.

## Summary of changes

The output of pbkdf2 is the XOR of the outputs of each iteration round,
eg `U1 ^ U2 ^ ... U15 ^ U16 ^ U17 ^ ... ^ Un`. We cache the suffix of
the expression `U16 ^ U17 ^ ... ^ Un`. To compute the result from the
cached suffix, we only need to compute the prefix `U1 ^ U2 ^ ... U15`.
The suffix by itself is useless, which prevent's its use in brute-force
attacks should this cached memory leak.

We are also caching the full 4096 round hash in memory, which can be
used for brute-force attacks, where this suffix could be used to speed
it up. My hope/expectation is that since these will be in different
allocations, it makes any such memory exploitation much much harder.
Since the full hash cache might be invalidated while the suffix is
cached, I'm storing the timestamp of the computation as a way to
identity the match.

I also added `zeroize()` to clear the sensitive state from the
stack/heap.

For the most security conscious customers, we hope to roll out OIDC
soon, so they can disable passwords entirely.

---

The numbers for the threadpool were pretty random, but according to our
busiest region for sql-over-http, we only see about 150 unique endpoints
every minute. So storing ~100 of the most common endpoints for that
minute should be the vast majority of requests.

1 minute was chosen so we don't keep data in memory for too long.
2025-07-29 06:48:14 +00:00
Ivan Efremov
6be572177c chore: Fix nightly lints (#12746)
- Remove some unused code
- Use `is_multiple_of()` instead of '%'
- Collapse consecuative "if let" statements
- Elided lifetime fixes

It is enough just to review the code of your team
2025-07-28 21:36:30 +00:00
Alex Chi Z.
fe7a4e1ab6 fix(test): wait compaction in timeline offload test (#12673)
## Problem

close LKB-753. `test_pageserver_metrics_removed_after_offload` is
unstable and it sometimes leave the metrics behind after tenant
offloading. It turns out that we triggered an image compaction before
the offload and the job was stopped after the offload request was
completed.

## Summary of changes

Wait all background tasks to finish before checking the metrics.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-28 16:27:55 +00:00