Commit Graph

14 Commits

Author SHA1 Message Date
Erik Grinaker
5330122049 test_runner: improve wait_until (#9936)
Improves `wait_until` by:

* Use `timeout` instead of `iterations`. This allows changing the
timeout/interval parameters independently.
* Make `timeout` and `interval` optional (default 20s and 0.5s). Most
callers don't care.
* Only output status every 1s by default, and add optional
`status_interval` parameter.
* Remove `show_intermediate_error`, this was always emitted anyway.

Most callers have been updated to use the defaults, except where they
had good reason otherwise.
2024-12-02 10:26:15 +00:00
Alexander Bayandin
8d1c44039e Python 3.11 (#9515)
## Problem

On Debian 12 (Bookworm), Python 3.11 is the latest available version.

## Summary of changes
- Update Python to 3.11 in build-tools
- Fix ruff check / format
- Fix mypy
- Use `StrEnum` instead of pair `str`, `Enum`
- Update docs
2024-11-21 16:25:31 +00:00
Tristan Partin
5bd8e2363a Enable all pyupgrade checks in ruff
This will help to keep us from using deprecated Python features going
forward.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-08 14:32:26 -05:00
Arseny Sher
17672c88ff tests: wait walreceiver on sks to be gone on 'immediate' ep restart. (#9099)
When endpoint is stopped in immediate mode and started again there is a
chance of old connection delivering some WAL to safekeepers after second
start checked need for sync-safekeepers and thus grabbed basebackup LSN.
It makes basebackup unusable, so compute panics. Avoid flakiness by
waiting for walreceivers on safekeepers to be gone in such cases. A
better way would be to bump term on safekeepers if sync-safekeepers is
skipped, but it needs more infrastructure.

ref https://github.com/neondatabase/neon/issues/9079
2024-10-01 20:54:00 +03:00
Arseny Sher
c4cdfe66ac Fix flakiness of test_timeline_copy.
Timeline might be not initialized when timeline_start_lsn is
queried. Spotted by CI.
2024-09-26 19:01:45 +03:00
Arseny Sher
11cf16e3f3 safekeeper: add term_bump endpoint.
When walproposer observes now higher term it restarts instead of
crashing whole compute with PANIC; this avoids compute crash after
term_bump call. After successfull election we're still checking
last_log_term of the highest given vote to ensure basebackup is good,
and PANIC otherwise.

It will be used for migration per
035-safekeeper-dynamic-membership-change.md
and
https://github.com/neondatabase/docs/pull/21

ref https://github.com/neondatabase/neon/issues/8700
2024-09-06 19:13:50 +03:00
Arseny Sher
80512e2779 safekeeper: add endpoint resetting uploaded partial segment state.
Endpoint implementation sends msg to manager requesting to do the
reset. Manager stops current partial backup upload task if it exists and
performs the reset.

Also slightly tweak eviction condition: all full segments before
flush_lsn must be uploaded (and committed) and there must be only one
segment left on disk (partial). This allows to evict timelines which
started not on the first segment and didn't fill the whole
segment (previous condition wasn't good because last_removed_segno was
0).

ref https://github.com/neondatabase/neon/issues/8759
2024-09-03 17:21:36 +03:00
Arseny Sher
09362b6363 safekeeper: reorder routes and their handlers.
Routes and their handlers were in a bit different order in 1) routes
list 2) their implementation 3) python client 4) openapi spec, making
addition of new ones intimidating. Make it the same everywhere, roughly
lexicographically but preserving some of existing logic.

No functional changes.
2024-08-27 07:37:55 +03:00
Arseny Sher
d919770c55 safekeeper: add listing timelines
Adds endpoint GET /tenant/timeline listing all not deleted timelines.
2024-08-21 18:38:08 +03:00
John Spray
0159ae9536 safekeeper: eviction metrics (#8348)
## Problem

Follow up to https://github.com/neondatabase/neon/pull/8335, to improve
observability of how many evict/restores we are doing.

## Summary of changes

- Add `safekeeper_eviction_events_started_total` and
`safekeeper_eviction_events_completed_total`, with a "kind" label of
evict or restore. This gives us rates, and also ability to calculate how
many are in progress.
- Generalize SafekeeperMetrics test type to use the same helpers as
pageserver, and enable querying any metric.
- Read the new metrics at the end of the eviction test.
2024-07-11 17:05:35 +01:00
Arseny Sher
af40bf3c2e Fix term/epoch confusion in python tests.
Call epoch last_log_term and add separate term field.
2024-05-31 12:58:59 +03:00
Arseny Sher
3797566c36 safekeeper: test pull_timeline with WAL gc.
Do pull_timeline while WAL is being removed. To this end
- extract pausable_failpoint to utils, sprinkle pull_timeline with it
- add 'checkpoint' sk http endpoint to force WAL removal.

After fixing checking for pull file status code test fails so far which is
expected.
2024-05-25 06:06:32 +03:00
Joonas Koivunen
d9dcbffac3 python: allow using allowed_errors.py (#7719)
See #7718. Fix it by renaming all `types.py` to `common_types.py`.

Additionally, add an advert for using `allowed_errors.py` to test any
added regex.
2024-05-13 15:16:23 +03:00
Heikki Linnakangas
74d09b78c7 Keep walproposer alive until shutdown checkpoint is safe on safekepeers
The walproposer pretends to be a walsender in many ways. It has a
WalSnd slot, it claims to be a walsender by calling
MarkPostmasterChildWalSender() etc. But one different to real
walsenders was that the postmaster still treated it as a bgworker
rather than a walsender. The difference is that at shutdown,
walsenders are not killed until the very end, after the checkpointer
process has written the shutdown checkpoint and exited.

As a result, the walproposer always got killed before the shutdown
checkpoint was written, so the shutdown checkpoint never made it to
safekeepers. That's fine in principle, we don't require a clean
shutdown after all. But it also feels a bit silly not to stream the
shutdown checkpoint. It could be useful for initializing hot standby
mode in a read replica, for example.

Change postmaster to treat background workers that have called
MarkPostmasterChildWalSender() as walsenders. That unfortunately
requires another small change in postgres core.

After doing that, walproposers stay alive longer. However, it also
means that the checkpointer will wait for the walproposer to switch to
WALSNDSTATE_STOPPING state, when the checkpointer sends the
PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in
walproposer to receive and handle that signal reliably. Instead, we
mark walproposer as being in WALSNDSTATE_STOPPING always.

In commit 568f91420a, I assumed that shutdown will wait for all the
remaining WAL to be streamed to safekeepers, but before this commit
that was not true, and the test became flaky. This should make it
stable again.

Some tests wrongly assumed that no WAL could have been written between
pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing
flush_ep_to_pageserver which first stops the endpoint and then waits till all
committed WAL reaches the pageserver.

In passing extract safekeeper http client to its own module.
2024-03-11 23:29:32 +04:00