Commit Graph

7356 Commits

Author SHA1 Message Date
a-masterov
7607686f25 Make test extensions upgrade work with absent images (#11036)
## Problem
CI does not pass for the compute release due to the absence of some
images

## Summary of changes
Now we use the images from the old non-compute releases for non-compute
images
2025-02-28 11:16:22 +00:00
Vlad Lazar
0d6d58bd3e pageserver: make heatmap layer download API more cplane friendly (#10957)
## Problem

We intend for cplane to use the heatmap layer download API to warm up
timelines after unarchival. It's tricky for them to recurse in the
ancestors,
and the current implementation doesn't work well when unarchiving a
chain
of branches and warming them up.

## Summary of changes

* Add a `recurse` flag to the API. When the flag is set, the operation
recurses into the parent
timeline after the current one is done.
* Be resilient to warming up a chain of unarchived branches. Let's say
we unarchived `B` and `C` from
the `A -> B -> C` branch hierarchy. `B` got unarchived first. We
generated the unarchival heatmaps
and stash them in `A` and `B`. When `C` unarchived, it dropped it's
unarchival heatmap since `A` and `B`
already had one. If `C` needed layers from `A` and `B`, it was out of
luck. Now, when choosing whether
to keep an unarchival heatmap we look at its end LSN. If it's more
inclusive than what we currently have,
keep it.
2025-02-28 10:36:53 +00:00
John Spray
55633ebe3a storcon: enable API passthrough to nonzero shards (#11026)
## Problem

Storage controller will proxy GETs to pageserver-like tenant/timeline
paths through to the pageserver.

Usually GET passthroughs make sense to go to shard 0, e.g. if you want
to list timelines.

But sometimes you really want to know about a particular shard, e.g.
reading its cache state or similar.

## Summary of changes

- Accept shard IDs as well as tenant IDs in the passthrough route
- Refactor node lookup to take a shard ID and make the tenant ID case a
layer on top of that. This is one more lock take-drop during these
requests, but it's not particularly expensive and these requests
shouldn't be terribly frequent

This is not immediately used by anything, but will be there any time we
want to e.g. do a pass-through query to check the warmth of a tenant
cache on a particular shard or somesuch.
2025-02-28 08:42:08 +00:00
Heikki Linnakangas
a4b2009800 compute_ctl: Refactor, moving spec_apply functions to spec_apply.rs (#11006)
Seems nice to have the function and all its subroutines in the same
source file.
2025-02-27 20:13:06 +00:00
Alex Chi Z.
ab1f22b7d1 fix(pageserver): correctly access layer map in gc-compaction (#11021)
## Problem

layer_map access was unwrapped. It might return an error during
shutdown.

## Summary of changes

Propagate the layer_map access error back to the compaction loop.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-02-27 16:26:55 +00:00
JC Grünhage
7ed236e17e fix(ci): push prod container images again (#11020)
## Problem
https://github.com/neondatabase/neon/pull/10841 made building compute
and neon images optional on releases that don't need them. The
`push-<component>-image-prod` jobs had transitive dependencies that were
skipped due to that, causing the images not to be pushed to production
registries.

## Summary of changes

Add `!failure() && !cancelled() &&` to the beginning of the conditions
for these jobs to ensure they run even if some of their transitive
dependencies are skipped.
2025-02-27 16:16:14 +00:00
Konstantin Knizhnik
e58f264a05 Increase inmem SMGR size for walredo process to 100 pagees (#10937)
## Problem

We see `Inmem storage overflow` in page server logs:

https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339

walked process is using inseam SMGR with storage size limited by 64
pages with warning watermark 32 (based ion the assumption that
XLR_MAX_BLOCK_ID is 32, so WAL record can not access more than 32
pages).

Actually it is not true. We can update up to 3 forks for each block
(including update of FSM and VM forks).

## Summary of changes

This PR increases inseam SMGR size for walled process to 100 pages and
print stack trace in case of overflow.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-02-27 14:31:05 +00:00
Matthias van de Meent
a283edaccf PS/Prefetch: Use a timeout for reading data from TCP (#10834)
This reduces pressure on OS TCP buffers, reducing flush times in other
systems like PageServer.

## Problem

## Summary of changes
2025-02-27 14:00:18 +00:00
a-masterov
ad37199745 Separate the upgrade tests in timelines (#10974)
## Problem
We created extensions in a single database. The tests could interfere,
i.e., discover some service tables left by other extensions and produce
unexpected results.
## Summary of changes
The tests are now run in a separate timeline, so only one extension owns
the database, which prevents interference.
2025-02-27 13:45:18 +00:00
Erik Grinaker
93b59e65a2 pageserver: remove stale comment (#11016)
No longer true now that we eagerly notify the compaction loop.
2025-02-27 12:56:28 +00:00
Christian Schwarz
e35f7758d8 impr(controller_upcall_client): clean up copy-pasta code & add context to retries (#10991)
Before this PR, re-attach and validate would log the same warning
```
calling control plane generation validation API failed
```
on retry errors.

This can be confusing.

This PR makes the message generically valid for any upcall and adds
additional tracing spans to capture context.

Along the way, clean up some copy-pasta variable naming.

refs
-
https://github.com/neondatabase/neon/issues/10381#issuecomment-2684755827

---------

Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
2025-02-27 10:59:43 +00:00
Peter Bendel
3a3d62dc4f Bodobolero/test cum stats persistence (#10995)
## Problem

So far cumulative statistics have not been persisted when Neon scales to
zero (suspends endpoint).
With PR https://github.com/neondatabase/neon/pull/6560 the cumulative
statistics should now survive endpoint restarts and correctly trigger
the auto- vacuum and auto analyze maintenance

So far we did not have a testcase that validates that improvement in our
dev cloud environment with a real project.

## Summary of changes

Introduce testcase `test_cumulative_statistics_persistence`in the
benchmarking workflow running daily to verify:

- Verifies that the cumulative statistics are correctly persisted across
restarts.
- Cumulative statistics are important to persist across restarts because
they are used
-  when auto-vacuum an auto-analyze trigger conditions are met.
-  The test performs the following steps:
    - Seed a new project using pgbench
    - insert tuples that by itself are not enough to trigger auto-vacuum
    - suspend the endpoint
    - resume the endpoint
- insert additional tuples that by itself are not enough to trigger
auto-vacuum but in combination with the previous tuples are
- verify that autovacuum is triggered by the combination of tuples
inserted before and after endpoint suspension

## Test run


https://github.com/neondatabase/neon/actions/runs/13546879714/job/37860609089#step:6:282
2025-02-27 10:45:13 +00:00
Arpad Müller
a22be5af72 Migrate the last crates to edition 2024 (#10998)
Migrates the remaining crates to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.

Like the previous migration PRs, this is comprised of three commits:

* the first does the edition update and makes `cargo check`/`cargo
clippy` pass. we had to update bindgen to make its output [satisfy the
requirements of edition
2024](https://doc.rust-lang.org/edition-guide/rust-2024/unsafe-extern.html)
* the second commit does a `cargo fmt` for the new style edition.
* the third commit reorders imports as a one-off change. As before, it
is entirely optional.

Part of #10918
2025-02-27 09:40:40 +00:00
Christian Schwarz
f09843ef17 refactor(pageserver): propagate RequestContext to layer downloads (#11001)
For some reason the layer download API never fully got
`RequestContext`-infected.

This PR fixes that as a precursor to
- https://github.com/neondatabase/neon/issues/6107
2025-02-27 09:26:25 +00:00
JC Grünhage
c92a36740b fix(ci): support PR-on-top-of-PR usecase again (#11013)
## Problem
https://github.com/neondatabase/neon/pull/10841 broke CI on PRs that
aren't based on main or a release branch but want to merge into another
PR.

## Summary of changes
Replace `run-kind=pr-main` with `run-kind=pr`, so that all PRs that
aren't release PRs are treated equally.
2025-02-27 09:05:15 +00:00
Arseny Sher
8b86cd1154 safekeeper: follow membership configuration rules (#10781)
## Problem

safekeepers must ignore walproposer messages with non matching
membership conf.

## Summary of changes

Make safekeepers reject vote request, proposer elected and append
request messages with non matching generation. Switch to the
configuration in the greeting message if it is higher.

In passing, fix one comment and WAL truncation.

Last part of https://github.com/neondatabase/neon/issues/9965
2025-02-27 06:13:30 +00:00
Heikki Linnakangas
c50b38ab72 compute_ctl: Fix comment on start_postgres (#11005)
The comment was woefully outdated and outright wrong. It applied a long
time ago (before commit e5cc2f92c4 to be precise), but nowadays the
function just launches postgres and waits until it starts accepting
connections. The other things the comment talked about are done in other
functions.
2025-02-26 23:38:45 +00:00
Fedor Dikarev
4f4a3910d0 fix error (Line: 74, Col: 26): Unexpected value 'false' (#10999)
## Problem
Check neon with extra platform builds is failing on main with:
```
The template is not valid. .github/workflows/neon_extra_builds.yml (Line: 74, Col: 26): Unexpected value 'false'
```
https://github.com/neondatabase/neon/actions/runs/13549634905

## Summary of changes
Use `fromJson()` to have `false` as boolean value.

thanks to @skyzh for pointing on the issue
2025-02-26 19:54:46 +00:00
Alex Chi Z.
11aab9f0de fix(pageserver): further stablize gc-compaction tests (#10975)
## Problem

Yet another source of flakyness for
https://github.com/neondatabase/neon/issues/10517

## Summary of changes

The test scenario we want to create is that we have an image layer in
index_part and then overwrite it, so we have to ensure it gets persisted
in index_part by doing a force checkpoint.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-02-26 19:50:10 +00:00
Heikki Linnakangas
5cfdb1244f compute_ctl: Add OTEL tracing to incoming HTTP requests and startup (#10971)
We lost this with the switch to axum for the HTTP server. Add it back.

In addition to just resurrecting the functionality we had before, pass
the tracing context of the /configure HTTP request to the start_postgres
operation that runs in the main thread. This way, the 'start_postgres'
and all its sub-spans like getting the basebackup become children of the
HTTP request span. This allows end-to-end tracing of a compute start,
all the way from the proxy to the SQL queries executed by compute_ctl as
part of compute startup.
2025-02-26 19:27:16 +00:00
Arseny Sher
643a48210f safekeeper: exclude API (#10757)
## Problem

https://github.com/neondatabase/neon/pull/10241 added configuration
switch endpoint, but it didn't delete timeline if node was excluded.

## Summary of changes

Add separate /exclude API endpoint which similarly accepts membership
configuration where sk is supposed by be excluded. Implementation
deletes the timeline locally.

Some more small related tweaks:
- make mconf switch API PUT instead of POST as it is idempotent;
- return 409 if switch was refused instead of 200 with requested &
current;
- remove unused was_active flag from delete response;
- remove meaningless _force suffix from delete functions names;
- reuse timeline.rs delete_dir function in timelines_global_map instead
of its own copy.

part of https://github.com/neondatabase/neon/issues/9965
2025-02-26 19:26:33 +00:00
Arseny Sher
c1a040447d walproposer: send valid timeline_start_lsn in v2 (#10994)
## Problem

https://github.com/neondatabase/neon/pull/10647 dropped
timeline_start_lsn from protocol messages as it can be taken from term
history. In v2 0 was sent in the placeholder. However, until safekeepers
are deployed with that PR they still use the value, setting
timeline_start_lsn to 0, which confuses WAL reading; problem appears
only when compute includes 10647 but safekeepers don't.

ref
https://neondb.slack.com/archives/C04DGM6SMTM/p1740577649644269?thread_ts=1740572363.541619&cid=C04DGM6SMTM

## Summary of changes

Send real value instead of 0 in v2.
2025-02-26 17:38:44 +00:00
Alex Chi Z.
30f3be9840 fix(test): reduce number of relations in test_tx_abort_with_many_relations (#10997)
## Problem

I see a lot of timeout errors, which indicates that this test is too
slow. It seems that create relations are fast, but the subsequent
truncating step is slow.

## Summary of changes

Reduce number of relations for now, and investigate later.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-02-26 17:19:14 +00:00
JC Grünhage
8dfa8f0b94 feat(ci): don't build storage on compute-releases and vice versa (#10841)
## Problem
Release CI is slow, because we're doing unnecessary work, for example
building compute images on storage releases and vice versa.

## Summary of changes
- Extract tag generation into reusable workflow and extend it with
fetching of previous component releases
- Don't build neon images on compute releases and don't build compute
images on proxy and storage releases
- Reuse images from previous releases for tests on branches where we
don't build those images

## Open questions
- We differentiate between `TAG` and `COMPUTE_TAG` in a few places, but
we don't differentiate between storage and proxy releases. Since they
use the same image, this will continue to work, but I'm not sure this is
what we want.
2025-02-26 17:17:26 +00:00
Alex Chi Z.
a138a6de9b fix(pageserver): correctly handle collect_keyspace errors (#10976)
## Problem

ref https://github.com/neondatabase/neon/issues/10927

## Summary of changes

* Implement `is_critical` and `is_cancel` over `CompactionError`.
* Revisit all places that uses `CollectKeyspaceError` to ensure they are
handled correctly.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-02-26 17:09:50 +00:00
Arpad Müller
14347630a4 ancestor detach: delete hardlinked layers on error (#10977)
Delete layers that we have hardlinked so far when there is an error in
`remote_copy`. This prevents a retry of the ancestor detach from
stumbling over already present layer files: the hardlink would fail with
an error.

If there is a crash, we already clean up during the timeline attach: we
loop over all layer files and purge all layers that are not referenced
by the `index_part.json`.

Make sure to hold the timeline gate to prevent races with
detach&attach&read from the layer file.

These cleanups aren't completely enough however, as there is code after
`prepare` as well. To handle errors there, we add a special case for
`AlreadyExists` errors during the hardlink, where we check if the layer
is an orphan, and if yes, we delete it from local disk. That is ideally
not the case we hit, as it is less clear in that scenario where the
layer came from, but it provides good defense in depth.

Related #10729
Fixes #10970
2025-02-26 16:11:15 +00:00
Erik Grinaker
86b9703f06 pageserver: set SO_KEEPALIVE on the page service socket (#10992)
## Problem

If the client connection goes dead without an explicit close (e.g. due
to network infrastructure dropping the connection) then we currently
won't detect it for a long time, which may e.g. block GetPage flushes
and keep the task running.

Touches https://github.com/neondatabase/cloud/issues/23515.

## Summary of changes

Enable `SO_KEEPALIVE` on the page service socket, to enable periodic TCP
keepalive probes. These are configured via Linux sysctls, which will be
deployed separately. By default, the first probe is sent after 2 hours,
so this doesn't have a practical effect until we change the sysctls.
2025-02-26 14:36:05 +00:00
Arseny Sher
01581f3af5 safekeeper: drop json_ctrl (#10722)
## Problem

json_ctrl.rs is an obsolete attempt to have tests with fine control of
feeding messages into safekeeper superseded by desim framework.

## Summary of changes

Drop it.
2025-02-26 13:32:37 +00:00
Arpad Müller
f94286f0c9 Upgrade compute_tools and compute_api to edition 2024 (#10983)
Updates `compute_tools` and `compute_api` crates to edition 2024. We
like to stay on the latest edition if possible. There is no functional
changes, however some code changes had to be done to accommodate the
edition's breaking changes.

The PR has three commits:

* the first commit updates the named crates to edition 2024 and appeases
`cargo clippy` by changing code.
* the second commit performs a `cargo fmt` that does some minor changes
(not many)
* the third commit performs a cargo fmt with nightly options to reorder
imports as a one-time thing. it's completely optional, but I offer it
here for the compute team to review it.

I'd like to hear opinions about the third commit, if it's wanted and
felt worth the diff or not. I think most attention should be put onto
the first commit.

Part of #10918
2025-02-26 13:12:26 +00:00
Fedor Dikarev
c2a768086d add credentials for pulling containers for the jobs (#10987)
Ref: https://github.com/neondatabase/cloud/issues/24939

## Problem
I found that we are missing authorization for some container jobs, that
will make them use anonymous pulls. It's not an issue for now, with high
enough limits, but that could be an issue when new limits introduced in
DockerHub (10 pulls / hour)

## Summary of changes
- add credentials for the jobs that run in containers
2025-02-26 12:50:06 +00:00
Vlad Lazar
622a9def6f tests: use generated record lsn instead of hardcoded one (#10990)
... and start the initial reader with the correct lsn

Closes https://github.com/neondatabase/neon/issues/10978
2025-02-26 12:47:13 +00:00
Arpad Müller
26bda17551 storcon: use the SchedulingPolicy enum in SafekeeperPersistence (#10897)
We don't want to serialize to/from string all the time, so use
`SchedulingPolicy` in `SafekeeperPersistence` via the use of a wrapper.

Stacked atop #10891
2025-02-26 12:12:50 +00:00
Folke Behrens
0d36f52a6c proxy: Record and export user-agent header (#10955)
neondatabase/cloud#24464
2025-02-26 11:39:34 +00:00
Heikki Linnakangas
40ad42d556 Silence "sudo: unable to resolve host" messages at compute startup (#10985) 2025-02-26 10:10:05 +00:00
Heikki Linnakangas
e452f2a5a3 Remove some redundant log lines at postgres startup (#10958) 2025-02-26 10:06:42 +00:00
Heikki Linnakangas
43b109af69 compute_ctl: Add more detailed tracing spans to startup subroutines (#10979)
In local dev environment, these steps take around 100 ms, and they are
in the critical path of a compute startup on a compute pool hit. I don't
know if it's like that in production, but as first step, add tracing
spans to the functions so that they can be measured more easily.
2025-02-26 09:51:07 +00:00
Arthur Petukhovsky
3684162d9f Bump vm-builder v0.37.1 -> v0.42.2 (#10981)
Bump version to pick up changes introduced in
https://github.com/neondatabase/autoscaling/pull/1286

It's better to have a compute release for this change first, because:
- vm-runner changes kernel loglevel from 7 to 6
- vm-builder has a change to bring it back to 7 after startup

Previous update: https://github.com/neondatabase/neon/pull/10015
2025-02-26 09:19:19 +00:00
Arpad Müller
920040e402 Update storage components to edition 2024 (#10919)
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.

The PR has two commits:

* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`

I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).

part of https://github.com/neondatabase/neon/issues/10918
2025-02-25 23:51:37 +00:00
Konstantin Knizhnik
dc975d554a Incremenet getpage histogram in prefetch_lookup (#10965)
## Problem

PR https://github.com/neondatabase/neon/pull/10442 added prefetch_lookup
function.
It changed handling of getpage requests in compute.

Before:
1. Lookup in LFC (return if found)
2. Register prefetch buffer
3. Wait prefetch result (increment getpage_hist)
Now:

1. Lookup prefetch ring (return if prefetch request is already
completed)
2. Lookup in LFC (return if found)
3. Register prefetch buffer
4. Wait prefetch result (increment getpage_hist)

So if prefetch result is already available, then get page histogram is
not incremented.
It case failure of some test_throughtput benchmarks:
https://neondb.slack.com/archives/C033RQ5SPDH/p1740425527249499

## Summary of changes

Increment getpage histogram in `prefetch_lookup`

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-02-25 19:51:38 +00:00
Suhas Thalanki
d05606252d fix: only showing LSN for static computes in neon endpoint list (#10931)
## Problem

`neon endpoint list` shows a different LSN than what the state of the
replica is. This is mainly down to what we define as LSN in this output.
If we define it as the LSN that a compute was started with, it only
makes sense to show it for static computes.

## Summary of changes

Removed the output of `last_record_lsn` for primary/hot standby
computes.

Closes: https://github.com/neondatabase/neon/issues/5825

---------

Co-authored-by: Tristan Partin <tristan@neon.tech>
2025-02-25 19:26:14 +00:00
Alex Chi Z.
c69ebb4486 fix(ci): extend timeout to 75min (#10963)
60min is not enough for debug builds

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-02-25 17:37:23 +00:00
a-masterov
1fb2faab5b Rename the patch files for the semver test (#10966)
## Problem
The patch for `semver` extensions relies on `PG_VERSION` environment
variable. The files were named without the letter `v` so script cannot
find them.
## Summary of changes
The patch files were renamed.
2025-02-25 16:00:43 +00:00
Alex Chi Z.
015092d259 feat(pageserver): add automatic trigger for gc-compaction (#10798)
## Problem

part of https://github.com/neondatabase/neon/issues/9114

## Summary of changes

Add the auto trigger for gc-compaction. It computes two values: L1 size
and L2 size. When L1 size >= initial trigger threshold, we will trigger
an initial gc-compaction. When l1_size / l2_size >=
gc_compaction_ratio_percent, we will trigger the "tiered" gc-compaction.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-02-25 14:50:39 +00:00
Alex Chi Z.
b7fcf2c7a7 test(pageserver): add reldir v2 into tests (#10750)
## Problem

We have `test_perf_many_relations` but it only runs on remote clusters,
and we cannot directly modify tenant config. Therefore, I patched one of
the current tests to benchmark relv2 performance.

close https://github.com/neondatabase/neon/issues/9986

## Summary of changes

* Add `v1/v2` selector to `test_tx_abort_with_many_relations`.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-02-25 14:50:22 +00:00
Erik Grinaker
8deeddd4f0 pageserver: ignore CollectKeySpaceError::Cancelled during compaction (#10968)
This pops up a few times during deployment. Not sure why it fires
without `self.cancel` being cancelled, but could be e.g. ancestor
timelines or sth.
2025-02-25 14:49:41 +00:00
a-masterov
f78ac44748 Use the Dockerfile COPY instead of docker cp (#10943)
## Problem
We use `docker cp` to copy the files required for the extension tests
now.
It causes problems if we run older images with the newer source tree.
## Summary of changes
Copying the files was moved to the compute Dockerfile.
2025-02-25 12:44:06 +00:00
Folke Behrens
f4fefd9f2f pre-commit: Switch to cargo fmt to handle per-crate editions (#10969)
cargo knows what edition each crate uses.
2025-02-25 12:29:27 +00:00
Konstantin Knizhnik
8f82c661d4 Move neon_pgstat_file_size_limit to the extension (#10959)
## Problem

PG14 uses separate backend for stats collector having no access to
shaerd memory.
As far as AUX mechanism requires access to shared memory, persisting
pgstat.stat file
is not supported at pg14. And so there is no definition of
`neon_pgstat_file_size_limit`
variable. It makes it impossible to provide same config for all Postgres
version.

## Summary of changes

Move neon_pgstat_file_size_limit to Neon extension.

Postgres submodules PR:
https://github.com/neondatabase/postgres/pull/587
https://github.com/neondatabase/postgres/pull/588
https://github.com/neondatabase/postgres/pull/589

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Tristan Partin <tristan@neon.tech>
2025-02-25 12:23:04 +00:00
Arseny Sher
758f597280 compute <-> sk protocol v3 (#10647)
## Problem

As part of https://github.com/neondatabase/neon/issues/8614 we need to
pass membership configurations between compute and safekeepers.

## Summary of changes

Add version 3 of the protocol carrying membership configurations.
Greeting message in both sides gets full conf, and other messages
generation number only. Use protocol bump to include other accumulated
changes:
- stop packing whole structs on the wire as is;
- make the tag u8 instead of u64;
- send all ints in network order;
- drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't
much useful anyway.
Per message changes, apart from mconf:
- ProposerGreeting: tenant / timeline id is sent now as hex cstring.
Remove proto version, it is passed outside in START_WAL_PUSH. Remove
postgres timeline, it is unused. Reorder fields a bit.
- AcceptorGreeting: reorder fields
- VoteResponse: timeline_start_lsn is removed. It can be taken from
first member of term history, and later we won't need it at all when all
timelines will be explicitly created. Vote itself is u8 instead of u64.
- ProposerElected: timeline_start_lsn is removed for the same reasons.
- AppendRequest: epoch_start_lsn removed, it is known from term history
in ProposerElected.

Both compute and sk are able to talk v2 and v3 to make rollbacks (in
case we need them) easier; neon.safekeeper_proto_version GUC sets the
client version. v2 code can be dropped later.

So far empty conf is passed everywhere, future PRs will handle them.

To test, add param to some tests choosing proto version; we want to test
both 2 and 3 until we fully migrate.

ref https://github.com/neondatabase/neon/issues/10326

---------

Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>
2025-02-25 11:56:05 +00:00
Vlad Lazar
0d9a45a475 safekeeper: invalidate start of interpreted batch on reader resets (#10951)
## Problem

The interpreted WAL reader tracks the start of the current logical
batch.
This needs to be invalidated when the reader is reset.

This bug caused a couple of WAL gap alerts in staging.

## Summary of changes

* Refactor to make it possible to write a reproducer
* Add repro unit test
* Fix by resetting the start with the reader

Related https://github.com/neondatabase/cloud/issues/23935
2025-02-25 10:21:35 +00:00