Commit Graph

8419 Commits

Author SHA1 Message Date
Christian Schwarz
be1c1df6aa Merge commit '84a2556c9' into problame/standby-horizon-leases 2025-08-06 17:58:54 +02:00
Christian Schwarz
7d28fb118b Merge commit 'f85935446' into problame/standby-horizon-leases 2025-08-06 17:58:36 +02:00
Christian Schwarz
daf2b5a806 Merge commit 'b00a0096b' into problame/standby-horizon-leases 2025-08-06 17:56:37 +02:00
Christian Schwarz
e52d0ef311 Merge commit '5b0972151' into problame/standby-horizon-leases 2025-08-06 17:56:07 +02:00
Christian Schwarz
d22e23f66d Merge commit '108f7ec54' into problame/standby-horizon-leases 2025-08-06 17:55:56 +02:00
Christian Schwarz
54480167dc Merge commit '9c0efba91' into problame/standby-horizon-leases 2025-08-06 17:55:48 +02:00
Christian Schwarz
30e7c4b75d Merge commit '187170be4' into problame/standby-horizon-leases 2025-08-06 17:55:39 +02:00
Christian Schwarz
d380111428 Merge commit '87915df2f' into problame/standby-horizon-leases 2025-08-06 17:55:06 +02:00
Christian Schwarz
78a8ac7be9 ruff format 2025-08-06 17:54:36 +02:00
Christian Schwarz
279865c68a Merge commit 'dd7fff655' into problame/standby-horizon-leases 2025-08-06 17:54:17 +02:00
Christian Schwarz
1ace4bcf23 Merge commit '809633903' into problame/standby-horizon-leases 2025-08-06 17:50:43 +02:00
Christian Schwarz
35c916c062 Merge commit '5c934efb2' into problame/standby-horizon-leases 2025-08-06 17:50:33 +02:00
Christian Schwarz
02e1aeef66 Merge commit 'a456e818a' into problame/standby-horizon-leases 2025-08-06 17:49:56 +02:00
Christian Schwarz
e2c88c1929 Merge commit '296c9190b' into problame/standby-horizon-leases 2025-08-06 17:49:50 +02:00
Christian Schwarz
553a120075 Merge commit '15f633922' into problame/standby-horizon-leases 2025-08-06 17:49:41 +02:00
Christian Schwarz
cfe345d3e6 Merge commit 'c34d36d8a' into problame/standby-horizon-leases 2025-08-06 17:47:29 +02:00
Christian Schwarz
e2facbde4e Merge commit 'cec0543b5' into problame/standby-horizon-leases 2025-08-06 17:47:10 +02:00
Christian Schwarz
b8c8168378 Merge commit 'be5bbaeca' into problame/standby-horizon-leases 2025-08-06 17:46:44 +02:00
Christian Schwarz
28a2cd05d5 Merge commit '5ec82105c' into problame/standby-horizon-leases 2025-08-06 17:46:37 +02:00
Christian Schwarz
1635390a96 fix all clippy complaints in this branch 2025-08-06 17:39:17 +02:00
Christian Schwarz
1877b70a35 Merge commit 'e7d18bc18' into problame/standby-horizon-leases 2025-08-06 17:19:37 +02:00
Christian Schwarz
fb7a027211 Merge commit '4ee0da0a2' into problame/standby-horizon-leases 2025-08-06 17:17:45 +02:00
Christian Schwarz
47146fe1d6 Merge commit '7049003cf' into problame/standby-horizon-leases 2025-08-06 17:17:11 +02:00
Christian Schwarz
577eee16f9 https://github.com/neondatabase/neon/pull/12676#discussion_r2220512343; concern about backward compat of TimelineInfo 2025-08-05 23:07:26 +02:00
Christian Schwarz
2ee0f4271c fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field
The LSN lease api actually accepts a tenant_shard_id, not a tenant_id.
But we put the Display of the tenant_shard_id into the tenant_id field.
This PR fixes it.

Refs
- fixes https://databricks.atlassian.net/browse/LKB-2930
2025-08-05 22:48:27 +02:00
Christian Schwarz
8a9f1dd5e7 use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism 2025-08-05 22:47:53 +02:00
Christian Schwarz
9f01840c18 use standby_horizon leases feature in the test, demonstrating that it passes now 2025-08-05 22:47:28 +02:00
Christian Schwarz
44466cebdb WIP better observability for return values (SystemTime Debug is useless) 2025-08-05 22:46:54 +02:00
Christian Schwarz
b865e85de3 previous commit broke the tests because of the cfg business, see this commit's TODO 2025-08-05 22:46:24 +02:00
Christian Schwarz
73336962a8 finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes 2025-08-05 19:24:06 +02:00
Christian Schwarz
fc7267a760 feature-gate compute side code 2025-08-05 19:22:58 +02:00
Christian Schwarz
3365c8c648 enforce standby_horizon leases are always above applied_gc_cutoff (check against cutoff on upsert + block gc for lease length to allow renewals after attach) 2025-07-26 16:38:44 +02:00
Christian Schwarz
bc09df8823 add todo about init deadline 2025-07-26 16:23:59 +02:00
Christian Schwarz
e1eb98c0e9 add basic test & fix embarrasing bug in cull (needs comment out todo!()) 2025-07-26 16:23:59 +02:00
Christian Schwarz
1e61ac6af2 cargo fmt (unrelated to prev commit) 2025-07-26 16:23:59 +02:00
Christian Schwarz
a948054db3 naming orhtodoxy: always refere to leases as LSN leases 2025-07-26 16:23:59 +02:00
Christian Schwarz
2ee24900ca have claude generate plumbing for standby_horizon_lease_length 2025-07-25 13:16:20 +02:00
Christian Schwarz
23d1029afd explain why there's no need to check standby_horizon lease deadline for getpage requests 2025-07-25 09:30:27 +00:00
Alexander Bayandin
84a2556c9f compute-node.Dockerfile: update bullseye-backports backports url (#12700)
## Problem

> bullseye-backports has reached end-of-life and is no longer supported
or updated

From: https://backports.debian.org/Instructions/

This causes the compute-node image build to fail with the following
error:
```
0.099 Err:5 http://deb.debian.org/debian bullseye-backports Release
0.099   404  Not Found [IP: 146.75.122.132 80]
...
1.293 E: The repository 'http://deb.debian.org/debian bullseye-backports Release' does not have a Release file.
```

## Summary of changes
- Use archive version of `bullseye-backports`
2025-07-23 14:45:52 +00:00
Conrad Ludgate
761e9e0e1d [proxy] move read_info from the compute connection to be as late as possible (#12660)
Second attempt at #12130, now with a smaller diff.

This allows us to skip allocating for things like parameter status and
notices that we will either just forward untouched, or discard.

LKB-2494
2025-07-23 13:33:21 +00:00
Dmitrii Kovalkov
94cb9a79d9 safekeeper: generation aware timeline tombstones (#12482)
## Problem
With safekeeper migration in mind, we can now pull/exclude the timeline
multiple times within the same safekeeper. To avoid races between out of
order requests, we need to ignore the pull/exclude requests if we have
already seen a higher generation.

- Closes: https://github.com/neondatabase/neon/issues/12186
- Closes: [LKB-949](https://databricks.atlassian.net/browse/LKB-949)

## Summary of changes
- Annotate timeline tombstones in safekeeper with request generation.
- Replace `ignore_tombstone` option with `mconf` in
`PullTimelineRequest`
- Switch membership in `pull_timeline` if the existing/pulled timeline
has an older generation.
- Refuse to switch membership if the timeline is being deleted
(`is_canceled`).
- Refuse to switch membership in compute greeting request if the
safekeeper is not a member of `mconf`.
- Pass `mconf` in `PullTimelineRequest` in safekeeper_service

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2025-07-23 11:01:04 +00:00
Tristan Partin
fc242afcc2 PG ignore PageserverFeedback from unknown shards (#12671)
## Problem
When testing tenant splits, I found that PG can get backpressure
throttled indefinitely if the split is aborted afterwards. It turns out
that each PageServer activates new shard separately even before the
split is committed and they may start sending PageserverFeedback to PG
directly. As a result, if the split is aborted, no one resets the
pageserver feedback in PG, and thus PG will be backpressure throttled
forever unless it's restarted manually.

## Summary of changes
This PR fixes this problem by having
`walprop_pg_process_safekeeper_feedback` simply ignore all pageserver
feedback from unknown shards. The source of truth here is defined by the
shard map, which is guaranteed to be reloaded only after the split is
committed.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-22 21:41:56 +00:00
Suhas Thalanki
e275221aef add hadron-specific metrics (#12686) 2025-07-22 21:17:45 +00:00
Alex Chi Z.
f859354466 feat(pageserver): add db rel count as feature flag property (#12632)
## Problem

As part of the reldirv2 rollout: LKB-197.


We will use number of db/rels as a criteria whether to rollout reldirv2
directly on the write path (simplest and easiest way of rollout). If the
number of rel/db is small then it shouldn't take too long time on the
write path.

## Summary of changes

* Compute db/rel count during basebackup.
* Also compute it during logical size computation.
* Collect maximum number of db/rel across all timelines in the feature
flag propeties.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-22 17:55:07 +00:00
Konstantin Knizhnik
b00a0096bf Reintialize page in allocNewBuffer only when buffer is returned (#12399)
## Problem

See https://github.com/neondatabase/neon/issues/12387

`allocNewBuffer` initialise page with zeros 
but not always return it because of parity checks.
In case of wrong parity the page is rejected and as a result we have
dirty page with zero LSN, which cause assertion failure on neon_write
when page is evicted from shared buffers.

## Summary of changes

Perform, page initialisation in `allocNewBuffer` only when buffer is
returned (parity check is passed).

Postgres PRs:
https://github.com/neondatabase/postgres/pull/661
https://github.com/neondatabase/postgres/pull/662
https://github.com/neondatabase/postgres/pull/663
https://github.com/neondatabase/postgres/pull/664

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
2025-07-22 17:50:26 +00:00
a-masterov
b3844903e5 Add new operations to Random operations test (#12213)
## Problem
We did not test some Public API calls, such as using a timestamp to
create a branch, reset_to_parent.
## Summary of changes
Tests now include some other operations: reset_to_parent, a branch
creation from any time in the past, etc.
Currently, the API calls are only exposed; the semantics are not
verified.

---------

Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>
2025-07-22 17:43:01 +00:00
Vlad Lazar
5b0972151c pageserver: silence shard resolution warning (#12685)
## Problem

We drive the get page requests that have started processing to
completion. So in the case when the compute received a reconfiguration
request and the old connection has a request procesing on the
pageserver, we are going to issue the warning.

I spot checked a few instances of the warning and in all cases the
compute was already connected to the correct pageserver.

## Summary of Changes

Downgrade to INFO. It would be nice to somehow figure out if the
connection has been terminated in the meantime, but the terminate libpq
message is still in the pipe while we're doing the shard resolution.

Closes LKB-2381
2025-07-22 17:34:23 +00:00
Heikki Linnakangas
51ffeef93f Fix postgres version compatibility macros (#12658)
The argument to BufTagInit was called 'spcOid', and it was also setting
a field called 'spcOid'. The field name would erroneously also be
expanded with the macro arg. It happened to work so far, because all the
users of the macro pass a variable called 'spcOid' for the 'spcOid'
argument, but as soon as you try to pass anything else, it fails. And
same story for 'dbOid' and 'relNumber'. Rename the arguments to avoid
the name collision.

Also while we're at it, add parens around the arguments in a few macros,
to make them safer if you pass something non-trivial as the argument.
2025-07-22 16:52:57 +00:00
Erik Grinaker
0fe07dec32 test_runner: allow stuck reconciliation errors (#12682)
This log message was added in #12589.

During chaos tests, reconciles may not succeed for some time, triggering
the log message.

Resolves [LKB-2467](https://databricks.atlassian.net/browse/LKB-2467).
2025-07-22 16:43:35 +00:00
HaoyuHuang
8de320ab9b Add a few compute_tool changes (#12677)
## Summary of changes
All changes are no-op.
2025-07-22 16:22:18 +00:00