Files
neon/test_runner
Christian Schwarz 2d5a8462c8 add async walredo mode (disabled-by-default, opt-in via config) (#6548)
Before this PR, the `nix::poll::poll` call would stall the executor.

This PR refactors the `walredo::process` module to allow for different
implementations, and adds a new `async` implementation which uses
`tokio::process::ChildStd{in,out}` for IPC.

The `sync` variant remains the default for now; we'll do more testing in
staging and gradual rollout to prod using the config variable.

Performance
-----------

I updated `bench_walredo.rs`, demonstrating that a single `async`-based
walredo manager used by N=1...128 tokio tasks has lower latency and
higher throughput.

I further did manual less-micro-benchmarking in the real pageserver
binary.
Methodology & results are published here:

https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4

tl;dr:
- use pagebench against a pageserver patched to answer getpage request &
small-enough working set to fit into PS PageCache / kernel page cache.
- compare knee in the latency/throughput curve
    - N tenants, each 1 pagebench clients
    - sync better throughput at N < 30, async better at higher N
    - async generally noticable but not much worse p99.X tail latencies
- eyeballing CPU efficiency in htop, `async` seems significantly more
CPU efficient at ca N=[0.5*ncpus, 1.5*ncpus], worse than `sync` outside
of that band

Mental Model For Walredo & Scheduler Interactions
-------------------------------------------------

Walredo is CPU-/DRAM-only work.
This means that as soon as the Pageserver writes to the pipe, the
walredo process becomes runnable.

To the Linux kernel scheduler, the `$ncpus` executor threads and the
walredo process thread are just `struct task_struct`, and it will divide
CPU time fairly among them.

In `sync` mode, there are always `$ncpus` runnable `struct task_struct`
because the executor thread blocks while `walredo` runs, and the
executor thread becomes runnable when the `walredo` process is done
handling the request.
In `async` mode, the executor threads remain runnable unless there are
no more runnable tokio tasks, which is unlikely in a production
pageserver.

The above means that in `sync` mode, there is an implicit concurrency
limit on concurrent walredo requests (`$num_runtimes *
$num_executor_threads_per_runtime`).
And executor threads do not compete in the Linux kernel scheduler for
CPU time, due to the blocked-runnable-ping-pong.
In `async` mode, there is no concurrency limit, and the walredo tasks
compete with the executor threads for CPU time in the kernel scheduler.

If we're not CPU-bound, `async` has a pipelining and hence throughput
advantage over `sync` because one executor thread can continue
processing requests while a walredo request is in flight.

If we're CPU-bound, under a fair CPU scheduler, the *fixed* number of
executor threads has to share CPU time with the aggregate of walredo
processes.
It's trivial to reason about this in `sync` mode due to the
blocked-runnable-ping-pong.
In `async` mode, at 100% CPU, the system arrives at some (potentially
sub-optiomal) equilibrium where the executor threads get just enough CPU
time to fill up the remaining CPU time with runnable walredo process.

Why `async` mode Doesn't Limit Walredo Concurrency
--------------------------------------------------

To control that equilibrium in `async` mode, one may add a tokio
semaphore to limit the number of in-flight walredo requests.
However, the placement of such a semaphore is non-trivial because it
means that tasks queuing up behind it hold on to their request-scoped
allocations.
In the case of walredo, that might be the entire reconstruct data.
We don't limit the number of total inflight Timeline::get (we only
throttle admission).
So, that queue might lead to an OOM.

The alternative is to acquire the semaphore permit *before* collecting
reconstruct data.
However, what if we need to on-demand download?

A combination of semaphores might help: one for reconstruct data, one
for walredo.
The reconstruct data semaphore permit is dropped after acquiring the
walredo semaphore permit.
This scheme effectively enables both a limit on in-flight reconstruct
data and walredo concurrency.

However, sizing the amount of permits for the semaphores is tricky:
- Reconstruct data retrieval is a mix of disk IO and CPU work.
- If we need to do on-demand downloads, it's network IO + disk IO + CPU
work.
- At this time, we have no good data on how the wall clock time is
distributed.

It turns out that, in my benchmarking, the system worked fine without a
semaphore. So, we're shipping async walredo without one for now.

Future Work
-----------

We will do more testing of `async` mode and gradual rollout to prod
using the config flag.
Once that is done, we'll remove `sync` mode to avoid the temporary code
duplication introduced by this PR.
The flag will be removed.

The `wait()` for the child process to exit is still synchronous; the
comment [here](
655d3b6468/pageserver/src/walredo.rs (L294-L306))
is still a valid argument in favor of that.

The `sync` mode had another implicit advantage: from tokio's
perspective, the calling task was using up coop budget.
But with `async` mode, that's no longer the case -- to tokio, the writes
to the child process pipe look like IO.
We could/should inform tokio about the CPU time budget consumed by the
task to achieve fairness similar to `sync`.
However, the [runtime function for this is
`tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html).


Refs
----

refs #6628 
refs https://github.com/neondatabase/neon/issues/2975
2024-04-15 22:14:42 +02:00
..

Neon test runner

This directory contains integration tests.

Prerequisites:

  • Correctly configured Python, see /docs/sourcetree.md
  • Neon and Postgres binaries
    • See the root README.md for build directions If you want to test tests with test-only APIs, you would need to add --features testing to Rust code build commands. For convenience, repository cargo config contains build_testing alias, that serves as a subcommand, adding the required feature flags. Usage example: cargo build_testing --release is equivalent to cargo build --features testing --release
    • Tests can be run from the git tree; or see the environment variables below to run from other directories.
  • The neon git repo, including the postgres submodule (for some tests, e.g. pg_regress)

Test Organization

Regression tests are in the 'regress' directory. They can be run in parallel to minimize total runtime. Most regression test sets up their environment with its own pageservers and safekeepers (but see TEST_SHARED_FIXTURES).

'pg_clients' contains tests for connecting with various client libraries. Each client test uses a Dockerfile that pulls an image that contains the client, and connects to PostgreSQL with it. The client tests can be run against an existing PostgreSQL or Neon installation.

'performance' contains performance regression tests. Each test exercises a particular scenario or workload, and outputs measurements. They should be run serially, to avoid the tests interfering with the performance of each other. Some performance tests set up their own Neon environment, while others can be run against an existing PostgreSQL or Neon environment.

Running the tests

There is a wrapper script to invoke pytest: ./scripts/pytest. It accepts all the arguments that are accepted by pytest. Depending on your installation options pytest might be invoked directly.

Test state (postgres data, pageserver state, and log files) will be stored under a directory test_output.

You can run all the tests with:

./scripts/pytest

If you want to run all the tests in a particular file:

./scripts/pytest test_pgbench.py

If you want to run all tests that have the string "bench" in their names:

./scripts/pytest -k bench

To run tests in parellel we utilize pytest-xdist plugin. By default everything runs single threaded. Number of workers can be specified with -n argument:

./scripts/pytest -n4

By default performance tests are excluded. To run them explicitly pass performance tests selection to the script:

./scripts/pytest test_runner/performance

Useful environment variables:

NEON_BIN: The directory where neon binaries can be found. POSTGRES_DISTRIB_DIR: The directory where postgres distribution can be found. Since pageserver supports several postgres versions, POSTGRES_DISTRIB_DIR must contain a subdirectory for each version with naming convention v{PG_VERSION}/. Inside that dir, a bin/postgres binary should be present. DEFAULT_PG_VERSION: The version of Postgres to use, This is used to construct full path to the postgres binaries. Format is 2-digit major version nubmer, i.e. DEFAULT_PG_VERSION="14". Alternatively, you can use --pg-version argument. TEST_OUTPUT: Set the directory where test state and test output files should go. TEST_SHARED_FIXTURES: Try to re-use a single pageserver for all the tests. NEON_PAGESERVER_OVERRIDES: add a ;-separated set of configs that will be passed as RUST_LOG: logging configuration to pass into Neon CLI

Useful parameters and commands:

--pageserver-config-override=${value} -c values to pass into pageserver through neon_local cli

--preserve-database-files to preserve pageserver (layer) and safekeer (segment) timeline files on disk after running a test suite. Such files might be large, so removed by default; but might be useful for debugging or creation of svg images with layer file contents.

Let stdout, stderr and INFO log messages go to the terminal instead of capturing them: ./scripts/pytest -s --log-cli-level=INFO ... (Note many tests capture subprocess outputs separately, so this may not show much.)

Exit after the first test failure: ./scripts/pytest -x ... (there are many more pytest options; run pytest -h to see them.)

Writing a test

Every test needs a Neon Environment, or NeonEnv to operate in. A Neon Environment is like a little cloud-in-a-box, and consists of a Pageserver, 0-N Safekeepers, and compute Postgres nodes. The connections between them can be configured to use JWT authentication tokens, and some other configuration options can be tweaked too.

The easiest way to get access to a Neon Environment is by using the neon_simple_env fixture. The 'simple' env may be shared across multiple tests, so don't shut down the nodes or make other destructive changes in that environment. Also don't assume that there are no tenants or branches or data in the cluster. For convenience, there is a branch called empty, though. The convention is to create a test-specific branch of that and load any test data there, instead of the 'main' branch.

For more complicated cases, you can build a custom Neon Environment, with the neon_env fixture:

def test_foobar(neon_env_builder: NeonEnvBuilder):
    # Prescribe the environment.
    # We want to have 3 safekeeper nodes, and use JWT authentication in the
    # connections to the page server
    neon_env_builder.num_safekeepers = 3
    neon_env_builder.set_pageserver_auth(True)

    # Now create the environment. This initializes the repository, and starts
    # up the page server and the safekeepers
    env = neon_env_builder.init_start()

    # Run the test
    ...

For more information about pytest fixtures, see https://docs.pytest.org/en/stable/fixture.html

At the end of a test, all the nodes in the environment are automatically stopped, so you don't need to worry about cleaning up. Logs and test data are preserved for the analysis, in a directory under ../test_output/<testname>

Before submitting a patch

Ensure that you pass all obligatory checks.

Also consider:

  • Writing a couple of docstrings to clarify the reasoning behind a new test.
  • Adding more type hints to your code to avoid Any, especially:
    • For fixture parameters, they are not automatically deduced.
    • For function arguments and return values.