The current test was just SQL files only, but we also want to test a
remote extension which includes a loadable library. With both extensions
we should cover a larger portion of compute_ctl's remote extension code
paths.
Fixes: https://github.com/neondatabase/neon/issues/11146
Signed-off-by: Tristan Partin <tristan@neon.tech>
We did not have any tests on fast_import binary yet.
In this PR I have introduced:
- `FastImport` class and tools for testing in python
- basic test that runs fast import against vanilla postgres and checks
that data is there
Should be merged after https://github.com/neondatabase/neon/pull/10251
## Problem
Currently, we rerun only known flaky tests. This approach was chosen to
reduce the number of tests that go unnoticed (by forcing people to take
a look at failed tests and rerun the job manually), but it has some
drawbacks:
- In PRs, people tend to push new changes without checking failed tests
(that's ok)
- In the main, tests are just restarted without checking
(understandable)
- Parametrised tests become flaky one by one, i.e. if `test[1]` is flaky
`, test[2]` is not marked as flaky automatically (which may or may not
be the case).
I suggest rerunning all failed tests to increase the stability of GitHub
jobs and using the Grafana Dashboard with flaky tests for deeper
analysis.
## Summary of changes
- Rerun all failed tests twice at max
python based regression test setup for auth_broker. This uses a http
mock for cplane as well as the JWKs url.
complications:
1. We cannot just use local_proxy binary, as that requires the
pg_session_jwt extension which we don't have available in the current
test suite
2. We cannot use just any old http mock for local_proxy, as auth_broker
requires http2 to local_proxy
as such, I used the h2 library to implement an echo server - copied from
the examples in the h2 docs.
## Problem
https://github.com/neondatabase/neon/pull/8588 implemented the mechanism
for storage controller
leadership transfers. However, there's no tests that exercise the
behaviour.
## Summary of changes
1. Teach `neon_local` how to handle multiple storage controller
instances. Each storage controller
instance gets its own subdirectory (`storage_controller_1, ...`).
`storage_controller start|stop` subcommands
have also been extended to optionally accept an instance id.
2. Add a storage controller proxy test fixture. It's a basic HTTP server
that forwards requests from pageserver
and test env to the currently configured storage controller.
3. Add a test which exercises storage controller leadership transfer.
4. Finally fix a couple bugs that the test surfaced
## Problem
Shard splits worked, but weren't safe against failures (e.g. node crash
during split) yet.
Related: #6676
## Summary of changes
- Introduce async rwlocks at the scope of Tenant and Node:
- exclusive tenant lock is used to protect splits
- exclusive node lock is used to protect new reconciliation process that
happens when setting node active
- exclusive locks used in both cases when doing persistent updates (e.g.
node scheduling conf) where the update to DB & in-memory state needs to
be atomic.
- Add failpoints to shard splitting in control plane and pageserver
code.
- Implement error handling in control plane for shard splits: this
detaches child chards and ensures parent shards are re-attached.
- Crash-safety for storage controller restarts requires little effort:
we already reconcile with nodes over a storage controller restart, so as
long as we reset any incomplete splits in the DB on restart (added in
this PR), things are implicitly cleaned up.
- Implement reconciliation with offline nodes before they transition to
active:
- (in this context reconciliation means something like
startup_reconcile, not literally the Reconciler)
- This covers cases where split abort cannot reach a node to clean it
up: the cleanup will eventually happen when the node is marked active,
as part of reconciliation.
- This also covers the case where a node was unavailable when the
storage controller started, but becomes available later: previously this
allowed it to skip the startup reconcile.
- Storage controller now terminates on panics. We only use panics for
true "should never happen" assertions, and these cases can leave us in
an un-usable state if we keep running (e.g. panicking in a shard split).
In the unlikely event that we get into a crashloop as a result, we'll
rely on kubernetes to back us off.
- Add `test_sharding_split_failures` which exercises a variety of
failure cases during shard split.
Refactor tests to have less globals.
This will allow to hopefully write more complex tests for our new metric
collection requirements in #5297. Includes reverted work from #4761
related to test globals.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: MMeent <matthias@neon.tech>
## Problem
All tests have already been parametrised by Postgres version and build
type (to have them distinguishable in the Allure report), but despite
it, it's anyway required to have DEFAULT_PG_VERSION and BUILD_TYPE env
vars set to corresponding values, for example to
run`test_timeline_deletion_with_files_stuck_in_upload_queue[release-pg14-local_fs]`
test it's required to set `DEFAULT_PG_VERSION=14` and
`BUILD_TYPE=release`.
This PR makes the test framework pick up parameters from the test name
itself.
## Summary of changes
- Postgres version and build type related fixtures now are
function-scoped (instead of being sessions scoped before)
- Deprecate `--pg-version` argument in favour of DEFAULT_PG_VERSION env
variable (it's easier to parse)
- GitHub autocomment now includes only one command with all the failed
tests + runs them in parallel
This PR adds tests runs on Postgres 15 and created unified Allure report
with results for all tests.
- Split `.github/actions/allure-report` into
`.github/actions/allure-report-store` and
`.github/actions/allure-report-generate`
- Add debug or release pytest parameter for all tests (depending on
`BUILD_TYPE` env variable)
- Add Postgres version as a pytest parameter for all tests (depending on
`DEFAULT_PG_VERSION` env variable)
- Fix `test_wal_restore` and `restore_from_wal.sh` to support path with
`[`/`]` in it (fixed by applying spellcheck to the script and fixing all
warnings), `restore_from_wal_archive.sh` is deleted as unused.
- All known failures on Postgres 15 marked with xfail
This PR adds a plugin that automatically reruns (up to 3 times) flaky
tests. Internally, it uses data from `TEST_RESULT_CONNSTR` database and
`pytest-rerunfailures` plugin.
As the first approximation we consider the test flaky if it has failed on
the main branch in the last 10 days.
Flaky tests are fetched by `scripts/flaky_tests.py` script (it's
possible to use it in a standalone mode to learn which tests are flaky),
stored to a JSON file, and then the file is passed to the pytest plugin.
We were getting a warning like this from the pg_regress tests:
=================== warnings summary ===================
/usr/lib/python3/dist-packages/_pytest/config/__init__.py:663
/usr/lib/python3/dist-packages/_pytest/config/__init__.py:663: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: fixtures.pg_stats
self.import_plugin(import_spec)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
------------------ Benchmark results -------------------
To fix, reorder the imports in conftest.py. I'm not sure what exactly
the problem was or why the order matters, but the warning is gone and
that's good enough for me.
- The 'pageserver' fixture now sets up the repository and starts up
the Page Server automatically. In other words, the 'pageserver'
fixture provides a Page Server that's up and running and ready to
use in tests.
- The 'pageserver' fixture now also creates a branch called 'empty',
right after initializing the repository. By convention, all the
tests start by createing a new branch off 'empty' for the test. This
allows running all the tests against the same Page Server
concurrently. (I haven't tested that though. pytest doensn't
provide an option to run tests in parallel but there are extensions
for that.)
- Remove the 'zen_simple' fixture. Now that 'pageserver' provides
server that's up and running, it's pretty simple to use the
'pageserver' and 'postgres' fixtures directly.
- Don't assume host name or ports in the tests. They now use the
fields in the fixtures for that. That allows assigning the ports
dynamically, making it possible to run multiple page servers in
parallel, or running the tests in parallel with another page
server. This commit still hard codes the Page Server's port in the
fixture, though, so more work is needed to actually make it
possible.
- I made some changes to the 'postgres' fixture in commit 532918e13d,
which broke the other tests. Fix them.
- Divide the tests into two "batches" of roughly equal runtime, which
can be run in parallel
- Merge the 'test_file' and 'test_filter' options in CircleCI config
into one 'test_selection' option, for simplicity.