Commit Graph

6591 Commits

Author SHA1 Message Date
Arseny Sher
ced1903267 Rework CommitEntries 2024-11-29 10:11:33 +03:00
Arseny Sher
9584317564 writing entries, remove prop null conf 2024-11-27 16:55:38 +03:00
Arseny Sher
4c7bdfa70a becomeleader 2024-11-27 14:19:43 +03:00
Arseny Sher
59cb648457 voting 2024-11-27 12:53:26 +03:00
Arseny Sher
8c880d088b RestartProposer 2024-11-26 16:52:05 +03:00
Arseny Sher
c940f196ce Start reconfig 2024-11-26 16:34:40 +03:00
Arseny Sher
d0b4b3e64a fmt 2024-11-26 16:34:26 +03:00
Arseny Sher
617a8711be fix comment 2024-11-25 18:00:48 +03:00
Arseny Sher
da5b71b5c1 a bit of readme 2024-11-25 15:59:07 +03:00
Arseny Sher
6b62b7633b address review 2024-11-25 13:19:58 +03:00
Arseny Sher
07872b310c One more small model 2024-11-18 14:06:13 +03:00
Arseny Sher
90aa12c3d8 Add elected_history 2024-11-18 14:06:13 +03:00
Arseny Sher
02dc3b2ba2 remove obsolete nextentry 2024-11-18 14:06:13 +03:00
Arseny Sher
9ab6a89b5c p2a3t4l4 run 2024-11-18 14:06:13 +03:00
Arseny Sher
79137382e7 note on fig8 2024-11-18 14:06:13 +03:00
Arseny Sher
e87d0813d1 Add some TLC results. 2024-11-18 14:06:13 +03:00
Arseny Sher
ec7c8814f4 add p2a3t4l4 model 2024-11-18 14:06:13 +03:00
Arseny Sher
31c3eb7628 fix previous 2024-11-18 14:06:13 +03:00
Arseny Sher
a2c67361b0 Add cfg to out file name 2024-11-18 14:06:13 +03:00
Arseny Sher
0d057d1374 fix previous 2024-11-18 14:06:13 +03:00
Arseny Sher
2917e49391 add tools var 2024-11-18 14:06:13 +03:00
Arseny Sher
91357b05e8 Get cpu name differently 2024-11-18 14:06:13 +03:00
Arseny Sher
765adaf16c Add even bigger model. 2024-11-18 14:06:13 +03:00
Arseny Sher
50a23d5a14 Move CommittedNotTruncated 2024-11-18 14:06:13 +03:00
Arseny Sher
1c30e6a61a add big model 2024-11-18 14:06:13 +03:00
Arseny Sher
42a9ef3645 fix newline 2024-11-18 14:06:13 +03:00
Arseny Sher
83b8e5c117 Piece of protocol readme. 2024-11-18 14:06:13 +03:00
Arseny Sher
5912932de8 remove whitespace 2024-11-18 14:06:13 +03:00
Arseny Sher
9aa29712d3 more models 2024-11-18 14:06:13 +03:00
Arseny Sher
443a6fdfdb bad quorums 2024-11-18 14:06:13 +03:00
Arseny Sher
664569ecdb MaxTruncatedTerms 2024-11-18 14:06:13 +03:00
Arseny Sher
a9ced3573a Add CommittedNotTruncated 2024-11-18 14:06:13 +03:00
Arseny Sher
f7b9fc1c81 Save runs. 2024-11-18 14:06:13 +03:00
Arseny Sher
979f925949 CommitEntries. 2024-11-18 14:06:13 +03:00
Arseny Sher
13fd695e3f Add TruncateWal 2024-11-18 14:06:13 +03:00
Arseny Sher
2d0c22d77d Start adding term history, election works. 2024-11-18 14:06:13 +03:00
Arseny Sher
9cde4ab0a7 More clean separation of spec and model checking.
and runner script.
2024-11-18 14:06:13 +03:00
Peter Bendel
c3eecf6763 adapt pgvector bench to minor version upgrades of PostgreSql (#9784)
## Problem

pgvector benchmark is failing because after PostgreSQL minor version
upgrade previous version packages are no longer available in deb
repository

[example
failure](https://github.com/neondatabase/neon/actions/runs/11875503070/job/33092787149#step:4:40)

## Summary of changes

Update postgres minor version of packages to current version

[Example run after this
change](https://github.com/neondatabase/neon/actions/runs/11888978279/job/33124614605)
2024-11-18 10:47:43 +00:00
Konstantin Knizhnik
6fa9b0cd8c Use DATA_DIR instead of current workign directory in restore_from_wal script (#9729)
## Problem

See https://github.com/neondatabase/neon/issues/7750

test_wal_restore.sh is copying file to current working directory which
can cause interfere of test_wa_restore.py tests spawned of different
configurations.

## Summary of changes

Copy file to $DATA_DIR

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-11-18 11:55:38 +02:00
a-masterov
10bc1903e1 Fix the regression test running against the staging instance (#9773)
## Problem
The Postgres version was updated. The patch has to be updated
accordingly.
## Summary of changes
The patch of the regression test was updated.
2024-11-18 10:30:50 +01:00
John Spray
261d065e6f pageserver: respect no_sync in VirtualFile (#9772)
## Problem

`no_sync` initially just skipped syncfs on startup (#9677). I'm also
interested in flaky tests that time out during pageserver shutdown while
flushing l0s, so to eliminate disk throughput as a source of issues
there,

## Summary of changes

- Drive-by change for test timeouts: add a couple more ::info logs
during pageserver startup so it's obvious which part got stuck.
- Add a SyncMode enum to configure VirtualFile and respect it in
sync_all and sync_data functions
- During pageserver startup, set SyncMode according to `no_sync`
2024-11-18 08:59:05 +00:00
Christian Schwarz
b6154b03f4 build(deps): bump smallvec to 1.13.2 to get UB fix (#9781)
Smallvec 1.13.2 contains [an UB
fix](https://github.com/servo/rust-smallvec/pull/345).

Upstream opened [a
request](https://github.com/rustsec/advisory-db/issues/1960)
for this in the advisory-db but it never got acted upon.

Found while working on https://github.com/neondatabase/neon/pull/9321.
2024-11-17 21:25:16 +01:00
Erik Grinaker
8880134171 Cargo.toml: upgrade tikv-jemallocator to 0.6.0 (#9779) 2024-11-17 19:52:05 +01:00
Erik Grinaker
de7e4a34ca safekeeper: send AppendResponse on segment flush (#9692)
## Problem

When processing pipelined `AppendRequest`s, we explicitly flush the WAL
every second and return an `AppendResponse`. However, the WAL is also
implicitly flushed on segment bounds, but this does not result in an
`AppendResponse`. Because of this, concurrent transactions may take up
to 1 second to commit and writes may take up to 1 second before sending
to the pageserver.

## Summary of changes

Advance `flush_lsn` when a WAL segment is closed and flushed, and emit
an `AppendResponse`. To accommodate this, track the `flush_lsn` in
addition to the `flush_record_lsn`.
2024-11-17 18:19:14 +01:00
Vlad Lazar
ac689ab014 wal_decoder: rename end_lsn to next_record_lsn (#9776)
## Problem

It turns out that `WalStreamDecoder::poll_decode` returns the start LSN
of the next record and not the end LSN of the current record. They are
not always equal. For example, they're not equal when the record in
question is an XLOG SWITCH record.

## Summary of changes

Rename things to reflect that.
2024-11-15 21:53:11 +00:00
Tristan Partin
23eabb9919 Fix PG_MAJORVERSION_NUM typo
In ea32f1d0a3, Matthias added a feature to
our extension to expose more granular wait events. However, due to the
typo, those wait events were never registered, so we used the more
generic wait events instead.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-15 15:17:23 -06:00
Vlad Lazar
2af791ba83 wal_decoder: make InterpretedWalRecord serde (#9775)
## Problem

We want to serialize interpreted records to send them over the wire from
safekeeper to pageserver.

## Summary of changes

Make `InterpretedWalRecord` ser/de. This is a temporary change to get
the bulk of the lift merged in
https://github.com/neondatabase/neon/pull/9746. For going to prod, we
don't want to use bincode since we can't evolve the schema.
Questions on serialization will be tackled separately.
2024-11-15 20:34:48 +00:00
Mikhail Kot
e12628fe93 Collect max_connections metric (#9770)
This will further allow us to expose this metric to users
2024-11-15 17:42:41 +00:00
Arpad Müller
7880c246f1 Correct mistakes in offloaded timeline retain_lsn management (#9760)
PR #9308 has modified tenant activation code to take offloaded child
timelines into account for populating the list of `retain_lsn` values.
However, there is more places than just tenant activation where one
needs to update the `retain_lsn`s.

This PR fixes some bugs of the current code that could lead to
corruption in the worst case:

1. Deleting of an offloaded timeline would not get its `retain_lsn`
purged from its parent. With the patch we now do it, but as the parent
can be offloaded as well, the situatoin is a bit trickier than for
non-offloaded timelines which can just keep a pointer to their parent.
Here we can't keep a pointer because the parent might get offloaded,
then unoffloaded again, creating a dangling pointer situation. Keeping a
pointer to the *tenant* is not good either, because we might drop the
offloaded timeline in a context where a `offloaded_timelines` lock is
already held: so we don't want to acquire a lock in the drop code of
OffloadedTimeline.
2. Unoffloading a timeline would not get its `retain_lsn` values
populated, leading to it maybe garbage collecting values that its
children might need. We now call `initialize_gc_info` on the parent.
3. Offloading of a timeline would not get its `retain_lsn` values
registered as offloaded at the parent. So if we drop the `Timeline`
object, and its registration is removed, the parent would not have any
of the child's `retain_lsn`s around. Also, before, the `Timeline` object
would delete anything related to its timeline ID, now it only deletes
`retain_lsn`s that have `MaybeOffloaded::No` set.

Incorporates Chi's reproducer from #9753. cc
https://github.com/neondatabase/cloud/issues/20199

The `test_timeline_retain_lsn` test is extended:

1. it gains a new dimension, duplicating each mode, to either have the
"main" branch be the direct parent of the timeline we archive, or the
"test_archived_parent" branch intermediary, creating a three timeline
structure. This doesn't test anything fixed by this PR in particular,
just explores the vast space of possible configurations a little bit
more.
2. it gains two new modes, `offload-parent`, which tests the second
point, and `offload-no-restart` which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the
respective `self.initialize_gc_info()`, `gc_info.insert_child()` or
`ancestor_children.push()`.

Part of #8088

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
2024-11-15 14:22:29 +01:00
John Spray
04938d9d55 tests: tolerate pageserver 500s in test_timeline_archival_chaos (#9769)
## Problem

Test exposes cases where pageserver gives 500 responses, causing
failures like
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9766/11844529470/index.html#suites/d1acc79950edeb0563fc86236c620898/3546be2ffed99ba6

## Summary of changes

- Tolerate such messages, and link an issue for cleaning up the
pageserver not to return such 500s.
2024-11-15 13:22:05 +00:00