This PR adds tests runs on Postgres 15 and created unified Allure report
with results for all tests.
- Split `.github/actions/allure-report` into
`.github/actions/allure-report-store` and
`.github/actions/allure-report-generate`
- Add debug or release pytest parameter for all tests (depending on
`BUILD_TYPE` env variable)
- Add Postgres version as a pytest parameter for all tests (depending on
`DEFAULT_PG_VERSION` env variable)
- Fix `test_wal_restore` and `restore_from_wal.sh` to support path with
`[`/`]` in it (fixed by applying spellcheck to the script and fixing all
warnings), `restore_from_wal_archive.sh` is deleted as unused.
- All known failures on Postgres 15 marked with xfail
- Increase `connect_timeout` to 30s, which should be enough for
most of the cases
- If the script cannot connect to the DB (or any other
`psycopg2.OperationalError` occur) — do not fail the script, log
the error and proceed. Problems with fetching flaky tests shouldn't
block the PR
This PR adds posting a comment with test results. Each workflow run
updates the comment with new results.
The layout and the information that we post can be changed to our needs,
right now, it contains failed tests and test which changes status after
rerun (i.e. flaky tests)
This PR adds a plugin that automatically reruns (up to 3 times) flaky
tests. Internally, it uses data from `TEST_RESULT_CONNSTR` database and
`pytest-rerunfailures` plugin.
As the first approximation we consider the test flaky if it has failed on
the main branch in the last 10 days.
Flaky tests are fetched by `scripts/flaky_tests.py` script (it's
possible to use it in a standalone mode to learn which tests are flaky),
stored to a JSON file, and then the file is passed to the pytest plugin.
This script can be used to remove tenant directories on safekeepers for
projects which do not longer exist (deleted in the console).
To run this script you need to upload it to safekeeper (i.e. with SSH),
and run it with python3. Ansible can be used to run this script on
multiple safekeepers.
Fixes https://github.com/neondatabase/cloud/issues/3356
- add parse_query_param()
- use Cow<> where possible
- move param parsing code to utils::http::request
This was originally PR https://github.com/neondatabase/neon/pull/3502
which targeted a different branch.
closes #3510
For use in production in case on-demand download turns out to be
problematic during tenant_attach, or when we eventually introduce layer
eviction.
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
The code in this change was extracted from #2595 (Heikki’s on-demand
download draft PR).
High-Level Changes
- New RemoteLayer Type
- On-Demand Download As An Effect Of Page Reconstruction
- Breaking Semantics For Physical Size Metrics
There are several follow-up work items planned.
Refer to the Epic issue on GitHub: https://github.com/neondatabase/neon/issues/2029
closes https://github.com/neondatabase/neon/pull/3013
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
New RemoteLayer Type
====================
Instead of downloading all layers during tenant attach, we create
RemoteLayer instances for each of them and add them to the layer map.
On-Demand Download As An Effect Of Page Reconstruction
======================================================
At the heart of pageserver is Timeline::get_reconstruct_data(). It
traverses the layer map until it has collected all the data it needs to
produce the page image. Most code in the code base uses it, though many
layers of indirection.
Before this patch, the function would use synchronous filesystem IO to
load data from disk-resident layer files if the data was not cached.
That is not possible with RemoteLayer, because the layer file has not
been downloaded yet. So, we do the download when get_reconstruct_data
gets there, i.e., “on demand”.
The mechanics of how the download is done are rather involved, because
of the infamous async-sync-async sandwich problem that plagues the async
Rust world. We use the new PageReconstructResult type to work around
this. Its introduction is the cause for a good amount of code churn in
this patch. Refer to the block comment on `with_ondemand_download()`
for details.
Breaking Semantics For Physical Size Metrics
============================================
We rename prometheus metric pageserver_{current,resident}_physical_size to
reflect what this metric actually represents with on-demand download.
This intentionally BREAKS existing grafana dashboard and the cost model data
pipeline. Breaking is desirable because the meaning of this metrics has changed
with on-demand download. See
https://docs.google.com/document/d/12AFpvKY-7FZdR5a4CaD6Ir_rI3QokdCLSPJ6upHxJBo/edit#
for how we will handle this breakage.
Likewise, we rename the new billing_metrics’s PhysicalSize => ResidentSize.
This is not yet used anywhere, so, this is not a breaking change.
There is still a field called TimelineInfo::current_physical_size. It
is now the sum of the layer sizes in layer map, regardless of whether
local or remote. To compute that sum, we added a new trait method
PersistentLayer::file_size().
When updating the Python tests, we got rid of
current_physical_size_non_incremental. An earlier commit removed it from
the OpenAPI spec already, so this is not a breaking change.
test_timeline_size.py has grown additional assertions on the
resident_physical_size metric.
Closes https://github.com/neondatabase/neon/issues/2697
Example:
https://github.com/neondatabase/neon/actions/runs/3416774593/jobs/5688394855
Adds a set of tests on the storage Docker images before they are pushed
to the public registries:
* tests that pageserver binary has the correct version string (other
binaries are built with the same library, so it should be enough to test
one)
* tests that the compose file set-up works and all components are able
to start and perform a single SQL query (CREATE TABLE)
`test_tenant_relocation` ends up starting a temporary postgres instance with a fixed port. the change makes the port configurable at scripts/export_import_between_pageservers.py and uses that in test_tenant_relocation.
The 'local' part was always filled in, so that was easy to merge into
into the TimelineInfo itself. 'remote' only contained two fields,
'remote_consistent_lsn' and 'awaits_download'. I made
'remote_consistent_lsn' an optional field, and 'awaits_download' is now
false if the timeline is not present remotely.
However, I kept stub versions of the 'local' and 'remote' structs for
backwards-compatibility, with a few fields that are actively used by
the control plane. They just duplicate the fields from TimelineInfo
now. They can be removed later, once the control plane has been
updated to use the new fields.
Fixes#1873: previously any run of `make` caused the `postgres-v15-headers`
target to build. It copied a bunch of headers via `install -C`. Unfortunately,
some origins were symlinks in the `./pg_install/build` directory pointing
inside `./vendor/postgres-v15` (e.g. `pg_config_os.h` pointing to `linux.h`).
GNU coreutils' `install` ignores the `-C` key for non-regular files and
always overwrites the destination if the origin is a symlink. That in turn
made Cargo rebuild the `postgres_ffi` crate and all its dependencies because
it thinks that Postgres headers changed, even if they did not. That was slow.
Now we use a custom script that wraps the `install` program. It handles one
specific case and makes sure individual headers are never copied if their
content did not change. Hence, `postgres_ffi` is not rebuilt unless there were
some changes to the C code.
One may still have slow incremental single-threaded builds because Postgres
Makefiles spawn about 2800 sub-makes even if no files have been changed.
A no-op build takes "only" 3-4 seconds on my machine now when run with `-j30`,
and 20 seconds when run with `-j1`.
This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview.
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Mainly because it has better support for installing the packages from
different python versions.
It also has better dependency resolver than Pipenv. And supports modern
standard for python dependency management. This includes usage of
pyproject.toml for project specific configuration instead of per
tool conf files. See following links for details:
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/https://www.python.org/dev/peps/pep-0518/
tests are based on self-hosted runner which is physically close
to our staging deployment in aws, currently tests consist of
various configurations of pgbenchi runs.
Also these changes rework benchmark fixture by removing globals and
allowing to collect reports with desired metrics and dump them to json
for further analysis. This is also applicable to usual performance tests
which use local zenith binaries.