Improves `wait_until` by:
* Use `timeout` instead of `iterations`. This allows changing the
timeout/interval parameters independently.
* Make `timeout` and `interval` optional (default 20s and 0.5s). Most
callers don't care.
* Only output status every 1s by default, and add optional
`status_interval` parameter.
* Remove `show_intermediate_error`, this was always emitted anyway.
Most callers have been updated to use the defaults, except where they
had good reason otherwise.
## Problem
LFC is not enabled by default in tests, but it is enabled in production.
This increases the risk of errors in the production environment, which
were not found during the routine workflow.
However, enabling LFC for all the tests may overload the disk on our
servers and increase the number of failures.
So, we try enabling LFC in one case to evaluate the possible risk.
## Summary of changes
A new environment variable, USE_LFC is introduced. If it is set to true,
LFC is enabled by default in all the tests.
In our workflow, we enable LFC for PG17, release, x86-64, and disabled
for all other combinations.
---------
Co-authored-by: Alexey Masterov <alexeymasterov@neon.tech>
Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>
Part of #7497, closes https://github.com/neondatabase/neon/issues/8890.
## Problem
Since leases are in-memory objects, we need to take special care of them
after pageserver restarts and while doing a live migration. The approach
we took for pageserver restart is to wait for at least lease duration
before doing first GC. We want to do the same for live migration. Since
we do not do any GC when a tenant is in `AttachedStale` or
`AttachedMulti` mode, only the transition from `AttachedMulti` to
`AttachedSingle` requires this treatment.
## Summary of changes
- Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is
temporarily blocked from GC until we reach the deadline. This
information does not persist to S3.
- In `GCBlock::start`, skip the GC iteration if we are blocked by the
lsn lease deadline.
- In `TenantManager::upsert_location`, set the lsn_lease_deadline to
`Instant::now() + lsn_lease_length` so the granted leases have a chance
to be renewed before we run GC for the first time after transitioned
from AttachedMulti to AttachedSingle.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
This should generally be faster when running tests, especially those
that run with higher scales.
Ignoring test_lfc_resize since it seems like we are hitting a query
timeout for some reason that I have yet to investigate. A little bit of
improvemnt is better than none.
Signed-off-by: Tristan Partin <tristan@neon.tech>
This failed once with `relation "test" does not exist` when trying to
run the query on the standby. It's possible that the standby is started
before the CREATE TABLE is processed in the pageserver, and the standby
opens up for queries before it has received the CREATE TABLE transaction
from the primary. To fix, wait for the standby to catch up to the
primary before starting to run the queries.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8025/9483658488/index.html
Don't set last-written LSN of a page when the record is replayed, only
when the page is evicted from cache. For comparison, we don't update
the last-written LSN on every page modification on the primary either,
only when the page is evicted. Do update the last-written LSN when the
page update is skipped in WAL redo, however.
In neon_get_request_lsns(), don't be surprised if the last-written LSN
is equal to the record being replayed. Use the LSN of the record being
replayed as the request LSN in that case. Add a long comment
explaining how that can happen.
In neon_wallog_page, update last-written LSN also when Shutdown has
been requested. We might still fetch and evict pages for a while,
after shutdown has been requested, so we better continue to do that
correctly.
Enable the check that we don't evict a page with zero LSN also in
standby, but make it a LOG message instead of PANIC
Fixes issue https://github.com/neondatabase/neon/issues/7791
To avoid pageserver gc'ing data needed by standby, propagate standby apply LSN
through standby -> safekeeper -> broker -> pageserver flow and hold off GC for
it. Iteration of GC resets the value to remove the horizon when standby goes
away -- pushes are assumed to happen at least once between gc iterations. As a
safety guard max allowed lag compared to normal GC horizon is hardcoded as 10GB.
Add test for the feature.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
We set it for neon replica, if primary is running.
Postgres uses this GUC at the start,
to determine if replica should wait for
RUNNING_XACTS from primary or not.
Corresponding cloud PR is
https://github.com/neondatabase/cloud/pull/10183
* Add test hot-standby replica startup.
* Extract oldest_running_xid from XlRunningXits WAL records.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL
records torn across several XLogData messages. Add failpoint to safekeeper to
slow down sending. Also check for corrupted WAL complains in standby log.
Make the test a bit simpler in passing, e.g. we don't need explicit commits as
autocommit is enabled by default.
https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719https://github.com/neondatabase/cloud/issues/9057
## Problem
For some tests, we override the default timeout (300s / 5m) with a larger
values like 600s / 10m or even 1800s / 30m, even if it's not required.
I've collected some statistics (for the last 60 days) for tests
duration:
| test | max (s) | p99 (s) | p50 (s) | count |
|-----------------------------------|---------|---------|---------|-------|
| test_hot_standby | 9 | 2 | 2 | 5319 |
| test_import_from_vanilla | 16 | 9 | 6 | 5692 |
| test_import_from_pageserver_small | 37 | 7 | 5 | 5719 |
| test_pg_regress | 101 | 73 | 44 | 5642 |
| test_isolation | 65 | 56 | 39 | 5692 |
A couple of tests that I left with custom 600s / 10m timeout.
| test | max (s) | p99 (s) | p50 (s) | count |
|-----------------------------------|---------|---------|---------|-------|
| test_gc_cutoff | 456 | 224 | 109 | 5694 |
| test_pageserver_chaos | 528 | 267 | 121 | 5712 |
## Summary of changes
- Remove `@pytest.mark.timeout` annotation from several tests
## Problem
Currently, if a user creates a role, it won't by default have any grants
applied to it. If the compute restarts, the grants get applied. This
gives a very strange UX of being able to drop roles/not have any access
to anything at first, and then once something triggers a config
application, suddenly grants are applied. This removes these grants.
This PR adds tests runs on Postgres 15 and created unified Allure report
with results for all tests.
- Split `.github/actions/allure-report` into
`.github/actions/allure-report-store` and
`.github/actions/allure-report-generate`
- Add debug or release pytest parameter for all tests (depending on
`BUILD_TYPE` env variable)
- Add Postgres version as a pytest parameter for all tests (depending on
`DEFAULT_PG_VERSION` env variable)
- Fix `test_wal_restore` and `restore_from_wal.sh` to support path with
`[`/`]` in it (fixed by applying spellcheck to the script and fixing all
warnings), `restore_from_wal_archive.sh` is deleted as unused.
- All known failures on Postgres 15 marked with xfail
Notes:
- This still needs UI support from the Console
- I've not tuned any GUCs for PostgreSQL to make this work better
- Safekeeper has gotten a tweak in which WAL is sent and how: It now
sends zero-ed WAL data from the start of the timeline's first segment up to
the first byte of the timeline to be compatible with normal PostgreSQL
WAL streaming.
- This includes the commits of #3714
Fixes one part of https://github.com/neondatabase/neon/issues/769
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>