Commit Graph

8508 Commits

Author SHA1 Message Date
Christian Schwarz
7cbbe99731 Merge commit 'f3ee6e818' into problame/standby-horizon-leases 2025-08-06 18:43:40 +02:00
Christian Schwarz
a807371955 MERGE WITH CONFLICTS commit '1dce2a9e746edf7b93ce1048ebf63bf5c1395c18' into problame/standby-horizon-leases
Heikki implemented a better representation of pageserver conn info which
obsoletes my refactoring attempt. Use it.
2025-08-06 18:39:29 +02:00
Christian Schwarz
d888835555 Merge commit 'ca8852165' into problame/standby-horizon-leases 2025-08-06 18:01:04 +02:00
Christian Schwarz
668e82d713 Merge commit '07c3cfd2a' into problame/standby-horizon-leases 2025-08-06 18:00:59 +02:00
Christian Schwarz
e0d2d293b1 Merge commit '7cd006621' into problame/standby-horizon-leases 2025-08-06 18:00:38 +02:00
Christian Schwarz
8799e87ae3 Merge commit '62d844e65' into problame/standby-horizon-leases 2025-08-06 18:00:09 +02:00
Christian Schwarz
737f5825bb Merge commit 'b623fbae0' into problame/standby-horizon-leases 2025-08-06 17:59:58 +02:00
Christian Schwarz
b95106cd79 Merge commit '5c57e8a11' into problame/standby-horizon-leases 2025-08-06 17:59:21 +02:00
Christian Schwarz
be1c1df6aa Merge commit '84a2556c9' into problame/standby-horizon-leases 2025-08-06 17:58:54 +02:00
Christian Schwarz
7d28fb118b Merge commit 'f85935446' into problame/standby-horizon-leases 2025-08-06 17:58:36 +02:00
Christian Schwarz
daf2b5a806 Merge commit 'b00a0096b' into problame/standby-horizon-leases 2025-08-06 17:56:37 +02:00
Christian Schwarz
e52d0ef311 Merge commit '5b0972151' into problame/standby-horizon-leases 2025-08-06 17:56:07 +02:00
Christian Schwarz
d22e23f66d Merge commit '108f7ec54' into problame/standby-horizon-leases 2025-08-06 17:55:56 +02:00
Christian Schwarz
54480167dc Merge commit '9c0efba91' into problame/standby-horizon-leases 2025-08-06 17:55:48 +02:00
Christian Schwarz
30e7c4b75d Merge commit '187170be4' into problame/standby-horizon-leases 2025-08-06 17:55:39 +02:00
Christian Schwarz
d380111428 Merge commit '87915df2f' into problame/standby-horizon-leases 2025-08-06 17:55:06 +02:00
Christian Schwarz
78a8ac7be9 ruff format 2025-08-06 17:54:36 +02:00
Christian Schwarz
279865c68a Merge commit 'dd7fff655' into problame/standby-horizon-leases 2025-08-06 17:54:17 +02:00
Christian Schwarz
1ace4bcf23 Merge commit '809633903' into problame/standby-horizon-leases 2025-08-06 17:50:43 +02:00
Christian Schwarz
35c916c062 Merge commit '5c934efb2' into problame/standby-horizon-leases 2025-08-06 17:50:33 +02:00
Christian Schwarz
02e1aeef66 Merge commit 'a456e818a' into problame/standby-horizon-leases 2025-08-06 17:49:56 +02:00
Christian Schwarz
e2c88c1929 Merge commit '296c9190b' into problame/standby-horizon-leases 2025-08-06 17:49:50 +02:00
Christian Schwarz
553a120075 Merge commit '15f633922' into problame/standby-horizon-leases 2025-08-06 17:49:41 +02:00
Christian Schwarz
cfe345d3e6 Merge commit 'c34d36d8a' into problame/standby-horizon-leases 2025-08-06 17:47:29 +02:00
Christian Schwarz
e2facbde4e Merge commit 'cec0543b5' into problame/standby-horizon-leases 2025-08-06 17:47:10 +02:00
Christian Schwarz
b8c8168378 Merge commit 'be5bbaeca' into problame/standby-horizon-leases 2025-08-06 17:46:44 +02:00
Christian Schwarz
28a2cd05d5 Merge commit '5ec82105c' into problame/standby-horizon-leases 2025-08-06 17:46:37 +02:00
Christian Schwarz
1635390a96 fix all clippy complaints in this branch 2025-08-06 17:39:17 +02:00
Christian Schwarz
1877b70a35 Merge commit 'e7d18bc18' into problame/standby-horizon-leases 2025-08-06 17:19:37 +02:00
Christian Schwarz
fb7a027211 Merge commit '4ee0da0a2' into problame/standby-horizon-leases 2025-08-06 17:17:45 +02:00
Christian Schwarz
47146fe1d6 Merge commit '7049003cf' into problame/standby-horizon-leases 2025-08-06 17:17:11 +02:00
Christian Schwarz
577eee16f9 https://github.com/neondatabase/neon/pull/12676#discussion_r2220512343; concern about backward compat of TimelineInfo 2025-08-05 23:07:26 +02:00
Christian Schwarz
2ee0f4271c fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field
The LSN lease api actually accepts a tenant_shard_id, not a tenant_id.
But we put the Display of the tenant_shard_id into the tenant_id field.
This PR fixes it.

Refs
- fixes https://databricks.atlassian.net/browse/LKB-2930
2025-08-05 22:48:27 +02:00
Christian Schwarz
8a9f1dd5e7 use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism 2025-08-05 22:47:53 +02:00
Christian Schwarz
9f01840c18 use standby_horizon leases feature in the test, demonstrating that it passes now 2025-08-05 22:47:28 +02:00
Christian Schwarz
44466cebdb WIP better observability for return values (SystemTime Debug is useless) 2025-08-05 22:46:54 +02:00
Christian Schwarz
b865e85de3 previous commit broke the tests because of the cfg business, see this commit's TODO 2025-08-05 22:46:24 +02:00
Christian Schwarz
73336962a8 finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes 2025-08-05 19:24:06 +02:00
Christian Schwarz
fc7267a760 feature-gate compute side code 2025-08-05 19:22:58 +02:00
Krzysztof Szafrański
f3ee6e818d [proxy] Correctly classify ConnectErrors (#12793)
As is, e.g. quota errors on wake compute are logged as "compute" errors.
2025-07-31 09:53:48 +00:00
Dmitrii Kovalkov
edd60730c8 safekeeper: use last_log_term in mconf switch + choose most advanced sk in pull timeline (#12778)
## Problem
I discovered two bugs corresponding to safekeeper migration, which
together might lead to a data loss during the migration. The second bug
is from a hadron patch and might lead to a data loss during the
safekeeper restore in hadron as well.

1. `switch_membership` returns the current `term` instead of
`last_log_term`. It is used to choose the `sync_position` in the
algorithm, so we might choose the wrong one and break the correctness
guarantees.
2. The current `term` is used to choose the most advanced SK in
`pull_timeline` with higher priority than `flush_lsn`. It is incorrect
because the most advanced safekeeper is the one with the highest
`(last_log_term, flush_lsn)` pair. The compute might bump term on the
least advanced sk, making it the best choice to pull from, and thus
making committed log entries "uncommitted" after `pull_timeline`

Part of https://databricks.atlassian.net/browse/LKB-1017

## Summary of changes
- Return `last_log_term` in `switch_membership`
- Use `(last_log_term, flush_lsn)` as a primary key for choosing the
most advanced sk in `pull_timeline` and deny pulling if the `max_term`
is higher than on the most advanced sk (hadron only)
- Write tests for both cases
- Retry `sync_safekeepers` in `compute_ctl`
- Take into the account the quorum size when calculating `sync_position`
2025-07-31 09:29:25 +00:00
Aleksandr Sarantsev
975b95f4cd Introduce deletion API improvement RFC (#12484)
## Problem

The deletion logic had become difficult to understand and maintain.

## Summary of changes

- Added an RFC detailing proposed improvements to all deletion-related
APIs.

---------

Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
2025-07-31 08:34:47 +00:00
Mikhail
01c39f378e prewarm cancellation (#12785)
Add DELETE /lfc/prewarm route which handles ongoing prewarm
cancellation, update API spec, add prewarm Cancelled state
Add offload Cancelled state when LFC is not initialized
2025-07-30 22:05:51 +00:00
Dimitri Fontaine
4d3b28bd2e [Hadron] Always run databricks auth hook. (#12683) 2025-07-30 21:34:30 +00:00
Heikki Linnakangas
81ddd10be6 tests: Don't print Hostname on every test connection (#12782)
These lines are a significant fraction of the total log size of the
regression tests. And it seems very uninteresting, it's always
'localhost' in local tests.
2025-07-30 19:56:22 +00:00
Suhas Thalanki
e470997627 enable tests introduced in hadron commits (#12790)
Enables skipped tests introduced in hadron integration commits
2025-07-30 19:10:33 +00:00
Erik Grinaker
eb2741758b storcon: actually update gRPC address on reattach (#12784)
## Problem

In #12268, we added Pageserver gRPC addresses to the storage controller.
However, we didn't actually persist these in the database.

## Summary of changes

Update the database with the new gRPC address on reattach.
2025-07-30 16:18:35 +00:00
Matthias van de Meent
f3a0e4f255 Improve specificity with which we apply compute specs (#12773)
This makes sure we don't confuse user-controlled functions with PG's
builtin functions.

## Problem

See https://github.com/neondatabase/cloud/issues/31628
2025-07-30 15:29:16 +00:00
Suhas Thalanki
842a5091d5 [BRC-3051] Walproposer: Safekeeper quorum health metrics (#930) (#12750)
Today we don't have any indications (other than spammy logs in PG that
nobody monitors) if the Walproposer in PG cannot connect to/get votes
from all Safekeepers. This means we don't have signals indicating that
the Safekeepers are operating at degraded redundancy. We need these
signals.

Added plumbing in PG extension so that the `neon_perf_counters` view
exports the following gauge metrics on safekeeper health:
- `num_configured_safekeepers`: The total number of safekeepers
configured in PG.
- `num_active_safekeepers`: The number of safekeepers that PG is
actively streaming WAL to.

An alert should be raised whenever `num_active_safekeepers` <
`num_configured_safekeepers`.

The metrics are implemented by adding additional state to the
Walproposer shared memory keeping track of the active statuses of
safekeepers using a simple array. The status of the safekeeper is set to
active (1) after the Walproposer acquires a quorum and starts streaming
data to the safekeeper, and is set to inactive (0) when the connection
with a safekeeper is shut down. We scan the safekeeper status array in
Walproposer shared memory when collecting the metrics to produce results
for the gauges.

Added coverage for the metrics to integration test
`test_wal_acceptor.py::test_timeline_disk_usage_limit`.

## Problem

## Summary of changes

---------

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-30 15:14:59 +00:00
Suhas Thalanki
056056bef0 fix(compute): validate prewarm_local_cache() input (#12648)
## Problem
```
postgres=> select neon.prewarm_local_cache('\xfcfcfcfc01000000ffffffff070000000000000000000000000000000000000000000000000000000000000000000000000000ff', 1);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
FATAL:  server conn crashed?
```

The function takes a bytea argument and casts it to a C struct, without
validating the contents.

## Summary of changes

Added validation for number of pages to be prefetched and for the chunks
as well.
2025-07-30 14:33:19 +00:00