At the time of writing, the only logical difference between Attach and Load is that Attach learns the list of timelines by querying the remote storage, whereas Load learns it by listing the timelines directory of the tenant. This patch restructures the code such that Attach 1. prepares the on-disk state and then 2. calls into the same `load()` routine that is used by Load. Further, this patch provides the following fixes & improvements to Attach: 1. Make Attach durable before acknowledging it to the management API client. Before this change, we would acknowledge after just creating the in-memory tenant. In the event of a crash before creating the tenant directory and fsyncing the attaching marker, the pageserver would come up without the tenant present (404), even though we acknowledged success to the client. 2. Simplified resume logic if we crash during Attach. Before this patch, if we crashed during Attach with some timelines downloaded and others not downloaded, we would combine existing metadata files with remote ones one-by-one to figure out what's missing. That was necessary before on-demand download because we were downloading layer files as part of Attach. However, with on-demand download, Attach only downloads & writes the timeline metadata files. After this patch, when we crash during Attach, we blow away the tenant's directory while leaving its attach marker file in place. Then, we start over. IMO this is significantly easier to reason about compared to what we had before. Note that we were losing the work for the downloads even before this change, so that's not a regression (the old reconcile_with_remote would still need to download the `IndexPart`s when resuming Attach after a crash). If we want to improve on this in the future, I think the first order of business will be to avoiding re-downloading the `IndexPart`'s and the initial `list_remote_timelines()`. However, given that crashes should be rare, and attach events also, I don't think the number one priority with Attach code should be to make it as simple as possible. For (2), I changed the location of the attach marker file to be outside the tenant directory, so that we can use standard functions for removing the tenant directory. I even wrote a migration function for it, although in retrospect, I think it's quite unlikely that there are any tenants in attaching state deployed. But oh well, now the code is there, and it even has unit tests. We can delete the migration code once we've successfully rolled it out to all regions. The remaining wrinkle with this change is that Attach needs to hint the downloaded `IndexPart`s to `load()` so that it doesn't download them twice during Attach, which would be wasteful. The mechanism for this is the new `TenantLoadReason` and `TimelineLoadReason`. We could eliminate this particular case by on-demand downloading the metadata. However, that might open up another can of worms which I'd like to avoid. If we ever want to go that route, I suggest we start tracking the attachment state of a timeline more formally, e.g., in a `timelines.json` file. This is PR https://github.com/neondatabase/neon/pull/3466
Neon test runner
This directory contains integration tests.
Prerequisites:
- Correctly configured Python, see
/docs/sourcetree.md - Neon and Postgres binaries
- See the root README.md for build directions
If you want to test tests with test-only APIs, you would need to add
--features testingto Rust code build commands. For convenience, repository cargo config containsbuild_testingalias, that serves as a subcommand, adding the required feature flags. Usage example:cargo build_testing --releaseis equivalent tocargo build --features testing --release - Tests can be run from the git tree; or see the environment variables below to run from other directories.
- See the root README.md for build directions
If you want to test tests with test-only APIs, you would need to add
- The neon git repo, including the postgres submodule
(for some tests, e.g.
pg_regress)
Test Organization
Regression tests are in the 'regress' directory. They can be run in
parallel to minimize total runtime. Most regression test sets up their
environment with its own pageservers and safekeepers (but see
TEST_SHARED_FIXTURES).
'pg_clients' contains tests for connecting with various client libraries. Each client test uses a Dockerfile that pulls an image that contains the client, and connects to PostgreSQL with it. The client tests can be run against an existing PostgreSQL or Neon installation.
'performance' contains performance regression tests. Each test exercises a particular scenario or workload, and outputs measurements. They should be run serially, to avoid the tests interfering with the performance of each other. Some performance tests set up their own Neon environment, while others can be run against an existing PostgreSQL or Neon environment.
Running the tests
There is a wrapper script to invoke pytest: ./scripts/pytest.
It accepts all the arguments that are accepted by pytest.
Depending on your installation options pytest might be invoked directly.
Test state (postgres data, pageserver state, and log files) will
be stored under a directory test_output.
You can run all the tests with:
./scripts/pytest
If you want to run all the tests in a particular file:
./scripts/pytest test_pgbench.py
If you want to run all tests that have the string "bench" in their names:
./scripts/pytest -k bench
To run tests in parellel we utilize pytest-xdist plugin. By default everything runs single threaded. Number of workers can be specified with -n argument:
./scripts/pytest -n4
By default performance tests are excluded. To run them explicitly pass performance tests selection to the script:
./scripts/pytest test_runner/performance
Useful environment variables:
NEON_BIN: The directory where neon binaries can be found.
POSTGRES_DISTRIB_DIR: The directory where postgres distribution can be found.
Since pageserver supports several postgres versions, POSTGRES_DISTRIB_DIR must contain
a subdirectory for each version with naming convention v{PG_VERSION}/.
Inside that dir, a bin/postgres binary should be present.
DEFAULT_PG_VERSION: The version of Postgres to use,
This is used to construct full path to the postgres binaries.
Format is 2-digit major version nubmer, i.e. DEFAULT_PG_VERSION="14"
TEST_OUTPUT: Set the directory where test state and test output files
should go.
TEST_SHARED_FIXTURES: Try to re-use a single pageserver for all the tests.
NEON_PAGESERVER_OVERRIDES: add a ;-separated set of configs that will be passed as
RUST_LOG: logging configuration to pass into Neon CLI
Useful parameters and commands:
--pageserver-config-override=${value} -c values to pass into pageserver through neon_local cli
--preserve-database-files to preserve pageserver (layer) and safekeer (segment) timeline files on disk
after running a test suite. Such files might be large, so removed by default; but might be useful for debugging or creation of svg images with layer file contents.
Let stdout, stderr and INFO log messages go to the terminal instead of capturing them:
./scripts/pytest -s --log-cli-level=INFO ...
(Note many tests capture subprocess outputs separately, so this may not
show much.)
Exit after the first test failure:
./scripts/pytest -x ...
(there are many more pytest options; run pytest -h to see them.)
Writing a test
Every test needs a Neon Environment, or NeonEnv to operate in. A Neon Environment is like a little cloud-in-a-box, and consists of a Pageserver, 0-N Safekeepers, and compute Postgres nodes. The connections between them can be configured to use JWT authentication tokens, and some other configuration options can be tweaked too.
The easiest way to get access to a Neon Environment is by using the neon_simple_env
fixture. The 'simple' env may be shared across multiple tests, so don't shut down the nodes
or make other destructive changes in that environment. Also don't assume that
there are no tenants or branches or data in the cluster. For convenience, there is a
branch called empty, though. The convention is to create a test-specific branch of
that and load any test data there, instead of the 'main' branch.
For more complicated cases, you can build a custom Neon Environment, with the neon_env
fixture:
def test_foobar(neon_env_builder: NeonEnvBuilder):
# Prescribe the environment.
# We want to have 3 safekeeper nodes, and use JWT authentication in the
# connections to the page server
neon_env_builder.num_safekeepers = 3
neon_env_builder.set_pageserver_auth(True)
# Now create the environment. This initializes the repository, and starts
# up the page server and the safekeepers
env = neon_env_builder.init_start()
# Run the test
...
For more information about pytest fixtures, see https://docs.pytest.org/en/stable/fixture.html
At the end of a test, all the nodes in the environment are automatically stopped, so you
don't need to worry about cleaning up. Logs and test data are preserved for the analysis,
in a directory under ../test_output/<testname>
Before submitting a patch
Ensure that you pass all obligatory checks.
Also consider:
- Writing a couple of docstrings to clarify the reasoning behind a new test.
- Adding more type hints to your code to avoid
Any, especially:- For fixture parameters, they are not automatically deduced.
- For function arguments and return values.