Files
neon/test_runner
Christian Schwarz c47c5f4ace fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386)
# Refs

- fixes https://github.com/neondatabase/neon/issues/10309
- fixup of batching design, first introduced in
https://github.com/neondatabase/neon/pull/9851
- refinement of https://github.com/neondatabase/neon/pull/8339

# Problem

`Tenant::shutdown` was occasionally taking many minutes (sometimes up to
20) in staging and prod if the
`page_service_pipelining.mode="concurrent-futures"` is enabled.

# Symptoms

The issue happens during shard migration between pageservers.
There is page_service unavailability and hence effectively downtime for
customers in the following case:
1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`,
waiting for the gate to close.
2. Cplane/Storcon decides to transition the target `AttachedMulti` to
`AttachedSingle`.
3. That transition comes with a bump of the generation number, causing
the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` /
`Tenant::attach` cycle for the target location.
4. That `Tenant::shutdown` on the target gets stuck, waiting for the
gate to close.
5. Eventually the gate closes (`close completed`), correlating with a
`page_service` connection handler logging that it's exiting because of a
network error (`Connection reset by peer` or `Broken pipe`).

While in (4):
- `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls
to complete.
  So, really, this is a `Timeline::shutdown` bug.
- retries from Cplane/Storcon to complete above state transitions, fail
with errors related to the tenant mgr slot being in state
`TenantSlot::InProgress`, the tenant state being
`TenantState::Stopping`, and the timelines being in
`TimelineState::Stopping`, and the `Timeline::cancel` being cancelled.
- Existing (and/or new?) page_service connections log errors `error
reading relation or page version: Not found: Timed out waiting 30s for
tenant active state. Latest state: None`

# Root-Cause

After a lengthy investigation ([internal
write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4))
I arrived at the following root cause.

The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the
Batcher and Executor stages of the pipelined mode was storing a `Handle`
and thus `GateGuard` of the Timeline that was not shutting down.
The design assumption with pipelining was that this would always be a
short transient state.
However, that was incorrect: the Executor was stuck on writing/flushing
an earlier response into the connection to the client, i.e., socket
write being slow because of TCP backpressure.

The probable scenario of how we end up in that case:
1. Compute backend process sends a continuous stream of getpage prefetch
requests into the connection, but never reads the responses (why this
happens: see Appendix section).
2. Batch N is processed by Batcher and Executor, up to the point where
Executor starts flushing the response.
3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`.
4. Executor is still waiting for batch N flush to finish.
5. Batcher eventually hits the `TimeoutReader` error (10min).
From here on it waits on the
`spsc_fold.send(Err(QueryError(TimeoutReader_error)))`
which will never finish because the batch already inside the `spsc_fold`
is not
being read by the Executor, because the Executor is still stuck in the
flush.
   (This state is not observable at our default `info` log level)
6. Eventually, Compute backend process is killed (`close()` on the
socket) or Compute as a whole gets killed (probably no clean TCP
shutdown happening in that case).
7. Eventually, Pageserver TCP stack learns about (6) through RST packets
and the Executor's flush() call fails with an error.
8. The Executor exits, dropping `cancel_batcher` and its end of the
spsc_fold.
   This wakes Batcher, causing the `spsc_fold.send` to fail.
   Batcher exits.
   The pipeline shuts down as intended.
We return from `process_query` and log the `Connection reset by peer` or
`Broken pipe` error.

The following diagram visualizes the wait-for graph at (5)

```mermaid
flowchart TD
   Batcher --spsc_fold.send(TimeoutReader_error)--> Executor
   Executor --flush batch N responses--> socket.write_end
   socket.write_end --wait for TCP window to move forward--> Compute
```

# Analysis

By holding the GateGuard inside the `spsc_fold` open, the pipelining
implementation
violated the principle established in
(https://github.com/neondatabase/neon/pull/8339).
That is, that `Handle`s must only be held across an await point if that
await point
is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token.

In this case, we were holding the Handle inside the `spsc_fold` while
awaiting the
`pgb_writer.flush()` future.

One may jump to the conclusion that we should simply peek into the
spsc_fold to get
that Timeline cancel token and be sensitive to it during flush, then.

But that violates another principle of the design from
https://github.com/neondatabase/neon/pull/8339.
That is, that the page_service connection lifecycle and the Timeline
lifecycles must be completely decoupled.
Tt must be possible to shut down one shard without shutting down the
page_service connection, because on that single connection we might be
serving other shards attached to this pageserver.
(The current compute client opens separate connections per shard, but,
there are plans to change that.)

# Solution

This PR adds a `handle::WeakHandle` struct that does _not_ hold the
timeline gate open.
It must be `upgrade()`d to get a `handle::Handle`.
That `handle::Handle` _does_ hold the timeline gate open.

The batch queued inside the `spsc_fold` only holds a `WeakHandle`.
We only upgrade it while calling into the various `handle_` methods,
i.e., while interacting with the `Timeline` via `<Handle as
Deref<Target=Timeline>>`.
All that code has always been required to be (and is!) sensitive to
`Timeline::cancel`, and therefore we're guaranteed to bail from it
quickly when `Timeline::shutdown` starts.
We will drop the `Handle` immediately, before we start
`pgb_writer.flush()`ing the responses.
Thereby letting go of our hold on the `GateGuard`, allowing the timeline
shutdown to complete while the page_service handler remains intact.

# Code Changes

* Reproducer & Regression Test
* Developed and proven to reproduce the issue in
https://github.com/neondatabase/neon/pull/10399
* Add a `Test` message to the pagestream protocol (`cfg(feature =
"testing")`).
* Drive-by minimal improvement to the parsing code, we now have a
`PagestreamFeMessageTag`.
* Refactor `pageserver/client` to allow sending and receiving
`page_service` requests independently.
  * Add a Rust helper binary to produce situation (4) from above
* Rationale: (4) and (5) are the same bug class, we're holding a gate
open while `flush()`ing.
* Add a Python regression test that uses the helper binary to
demonstrate the problem.
* Fix
   * Introduce and use `WeakHandle` as explained earlier.
* Replace the `shut_down` atomic with two enum states for `HandleInner`,
wrapped in a `Mutex`.
* To make `WeakHandle::upgrade()` and `Handle::downgrade()`
cache-efficient:
     * Wrap the `Types::Timeline` in an `Arc`
     * Wrap the `GateGuard` in an `Arc`
* The separate `Arc`s enable uncontended cloning of the timeline
reference in `upgrade()` and `downgrade()`.
If instead we were `Arc<Timeline>::clone`, different connection handlers
would be hitting the same cache line on every upgrade()/downgrade(),
causing contention.
* Please read the udpated module-level comment in `mod handle`
module-level comment for details.

# Testing & Performance

The reproducer test that failed before the changes now passes, and
obviously other tests are passing as well.

We'll do more testing in staging, where the issue happens every ~4h if
chaos migrations are enabled in storcon.

Existing perf testing will be sufficient, no perf degradation is
expected.
It's a few more alloctations due to the added Arc's, but, they're low
frequency.

# Appendix: Why Compute Sometimes Doesn't Read Responses

Remember, the whole problem surfaced because flush() was slow because
Compute was not reading responses. Why is that?

In short, the way the compute works, it only advances the page_service
protocol processing when it has an interest in data, i.e., when the
pagestore smgr is called to return pages.

Thus, if compute issues a bunch of requests as part of prefetch but then
it turns out it can service the query without reading those pages, it
may very well happen that these messages stay in the TCP until the next
smgr read happens, either in that session, or possibly in another
session.

If there’s too many unread responses in the TCP, the pageserver kernel
is going to backpressure into userspace, resulting in our stuck flush().

All of this stems from the way vanilla Postgres does prefetching and
"async IO":
it issues `fadvise()` to make the kernel do the IO in the background,
buffering results in the kernel page cache.
It then consumes the results through synchronous `read()` system calls,
which hopefully will be fast because of the `fadvise()`.

If it turns out that some / all of the prefetch results are not needed,
Postgres will not be issuing those `read()` system calls.
The kernel will eventually react to that by reusing page cache pages
that hold completed prefetched data.
Uncompleted prefetch requests may or may not be processed -- it's up to
the kernel.

In Neon, the smgr + Pageserver together take on the role of the kernel
in above paragraphs.
In the current implementation, all prefetches are sent as GetPage
requests to Pageserver.
The responses are only processed in the places where vanilla Postgres
would do the synchronous `read()` system call.
If we never get to that, the responses are queued inside the TCP
connection, which, once buffers run full, will backpressure into
Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the
root cause of the problems we're fixing in this PR.
2025-01-16 20:34:02 +00:00
..
2024-11-21 16:25:31 +00:00

Neon test runner

This directory contains integration tests.

Prerequisites:

  • Correctly configured Python, see /docs/sourcetree.md
  • Neon and Postgres binaries
    • See the root README.md for build directions To run tests you need to add --features testing to Rust code build commands. For convenience, repository cargo config contains build_testing alias, that serves as a subcommand, adding the required feature flags. Usage example: cargo build_testing --release is equivalent to cargo build --features testing --release
    • Tests can be run from the git tree; or see the environment variables below to run from other directories.
  • The neon git repo, including the postgres submodule (for some tests, e.g. pg_regress)

Test Organization

Regression tests are in the 'regress' directory. They can be run in parallel to minimize total runtime. Most regression test sets up their environment with its own pageservers and safekeepers.

'pg_clients' contains tests for connecting with various client libraries. Each client test uses a Dockerfile that pulls an image that contains the client, and connects to PostgreSQL with it. The client tests can be run against an existing PostgreSQL or Neon installation.

'performance' contains performance regression tests. Each test exercises a particular scenario or workload, and outputs measurements. They should be run serially, to avoid the tests interfering with the performance of each other. Some performance tests set up their own Neon environment, while others can be run against an existing PostgreSQL or Neon environment.

Running the tests

There is a wrapper script to invoke pytest: ./scripts/pytest. It accepts all the arguments that are accepted by pytest. Depending on your installation options pytest might be invoked directly.

Test state (postgres data, pageserver state, and log files) will be stored under a directory test_output.

You can run all the tests with:

./scripts/pytest

If you want to run all the tests in a particular file:

./scripts/pytest test_pgbench.py

If you want to run all tests that have the string "bench" in their names:

./scripts/pytest -k bench

To run tests in parellel we utilize pytest-xdist plugin. By default everything runs single threaded. Number of workers can be specified with -n argument:

./scripts/pytest -n4

By default performance tests are excluded. To run them explicitly pass performance tests selection to the script:

./scripts/pytest test_runner/performance

Useful environment variables:

NEON_BIN: The directory where neon binaries can be found. COMPATIBILITY_NEON_BIN: The directory where the previous version of Neon binaries can be found POSTGRES_DISTRIB_DIR: The directory where postgres distribution can be found. Since pageserver supports several postgres versions, POSTGRES_DISTRIB_DIR must contain a subdirectory for each version with naming convention v{PG_VERSION}/. Inside that dir, a bin/postgres binary should be present. COMPATIBILITY_POSTGRES_DISTRIB_DIR: The directory where the prevoius version of postgres distribution can be found. DEFAULT_PG_VERSION: The version of Postgres to use, This is used to construct full path to the postgres binaries. Format is 2-digit major version nubmer, i.e. DEFAULT_PG_VERSION=16 TEST_OUTPUT: Set the directory where test state and test output files should go. RUST_LOG: logging configuration to pass into Neon CLI

Useful parameters and commands:

--preserve-database-files to preserve pageserver (layer) and safekeer (segment) timeline files on disk after running a test suite. Such files might be large, so removed by default; but might be useful for debugging or creation of svg images with layer file contents. If NeonEnvBuilder#preserve_database_files set to True for a particular test, the whole repo directory will be attached to Allure report (thus uploaded to S3) as everything.tar.zst for this test.

Let stdout, stderr and INFO log messages go to the terminal instead of capturing them: ./scripts/pytest -s --log-cli-level=INFO ... (Note many tests capture subprocess outputs separately, so this may not show much.)

Exit after the first test failure: ./scripts/pytest -x ... (there are many more pytest options; run pytest -h to see them.)

Running Python tests against real S3 or S3-compatible services

Neon's libs/remote_storage supports multiple implementations of remote storage. At the time of writing, that is

pub enum RemoteStorageKind {
    /// Storage based on local file system.
    /// Specify a root folder to place all stored files into.
    LocalFs(Utf8PathBuf),
    /// AWS S3 based storage, storing all files in the S3 bucket
    /// specified by the config
    AwsS3(S3Config),
    /// Azure Blob based storage, storing all files in the container
    /// specified by the config
    AzureContainer(AzureConfig),
}

The test suite has a Python enum with equal name but different meaning:

@enum.unique
class RemoteStorageKind(StrEnum):
    LOCAL_FS = "local_fs"
    MOCK_S3 = "mock_s3"
    REAL_S3 = "real_s3"
  • LOCAL_FS => LocalFs
  • MOCK_S3: starts moto's S3 implementation, then configures Pageserver with AwsS3
  • REAL_S3 => configure AwsS3 as detailed below

When a test in the test suite needs an AwsS3, it is supposed to call remote_storage.s3_storage(). That function checks env var ENABLE_REAL_S3_REMOTE_STORAGE:

  • If it is not set, use MOCK_S3
  • If it is set, use REAL_S3.

For REAL_S3, the test suite creates the dict/toml representation of the RemoteStorageKind::AwsS3 based on env vars:

pub struct S3Config {
    // test suite env var: REMOTE_STORAGE_S3_BUCKET
    pub bucket_name: String,
    // test suite env var: REMOTE_STORAGE_S3_REGION
    pub bucket_region: String,
    // test suite determines this
    pub prefix_in_bucket: Option<String>,
    // no env var exists; test suite sets it for MOCK_S3, because that's how moto works
    pub endpoint: Option<String>,
    ...
}

Credentials are not part of the config, but discovered by the AWS SDK. See the libs/remote_storage Rust code. We're documenting two mechanism here:

The test suite supports two mechanisms (remote_storage.py):

Credential mechanism 1: env vars AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Populate the env vars with AWS access keys that you created in IAM. Our CI uses this mechanism. However, it is not recommended for interactive use by developers (learn more). Instead, use profiles (next section).

Credential mechanism 2: env var AWS_PROFILE. This uses the AWS SDK's (and CLI's) profile mechanism. Learn more about it in the official docs. After configuring a profile (e.g. via the aws CLI), set the env var to its name.

In conclusion, the full command line is:

# with long-term AWS access keys
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=eu-central-1 \
AWS_ACCESS_KEY_ID=... \
AWS_SECRET_ACCESS_KEY=... \
./scripts/pytest
# with AWS PROFILE
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=eu-central-1 \
AWS_PROFILE=... \
./scripts/pytest

If you're using SSO, make sure to aws sso login --profile $AWS_PROFILE first.

Minio

If you want to run test without the cloud setup, we recommend minio.

# Start in Terminal 1
mkdir /tmp/minio_data
minio server /tmp/minio_data --console-address 127.0.0.1:9001 --address 127.0.0.1:9000

In another terminal, create an aws CLI profile for it:

# append to ~/.aws/config
[profile local-minio]
services = local-minio-services
[services local-minio-services]
s3 =
  endpoint_url=http://127.0.0.1:9000/

Now configure the credentials (this is going to write ~/.aws/credentials for you). It's an interactive prompt.

# Terminal 2
$ aws --profile local-minio configure
AWS Access Key ID [None]: minioadmin
AWS Secret Access Key [None]: minioadmin
Default region name [None]:
Default output format [None]:

Now create a bucket testbucket using the CLI.

# (don't forget to have AWS_PROFILE env var set; or use --profile)
aws --profile local-minio s3 mb s3://mybucket

(If it doesn't work, make sure you update your AWS CLI to a recent version. The service-specific endpoint feature that we're using is quite new.)

# with AWS PROFILE
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=doesntmatterforminio \
AWS_PROFILE=local-minio \
./scripts/pytest

NB: you can avoid the --profile by setting the AWS_PROFILE variable. Just like the AWS SDKs, the aws CLI is sensible to it.

Running Rust tests against real S3 or S3-compatible services

We have some Rust tests that only run against real S3, e.g., here.

They use the same env vars as the Python test suite (see previous section) but interpret them on their own. However, at this time, the interpretation is identical.

So, above instructions apply to the Rust test as well.

Writing a test

Every test needs a Neon Environment, or NeonEnv to operate in. A Neon Environment is like a little cloud-in-a-box, and consists of a Pageserver, 0-N Safekeepers, and compute Postgres nodes. The connections between them can be configured to use JWT authentication tokens, and some other configuration options can be tweaked too.

The easiest way to get access to a Neon Environment is by using the neon_simple_env fixture. For convenience, there is a branch called main in environments created with 'neon_simple_env', ready to be used in the test.

For more complicated cases, you can build a custom Neon Environment, with the neon_env fixture:

def test_foobar(neon_env_builder: NeonEnvBuilder):
    # Prescribe the environment.
    # We want to have 3 safekeeper nodes, and use JWT authentication in the
    # connections to the page server
    neon_env_builder.num_safekeepers = 3
    neon_env_builder.set_pageserver_auth(True)

    # Now create the environment. This initializes the repository, and starts
    # up the page server and the safekeepers
    env = neon_env_builder.init_start()

    # Run the test
    ...

The env includes a default tenant and timeline. Therefore, you do not need to create your own tenant/timeline for testing.

def test_foobar2(neon_env_builder: NeonEnvBuilder):
    env = neon_env_builder.init_start() # Start the environment
    with env.endpoints.create_start("main") as endpoint:
        # Start the compute endpoint
    client = env.pageserver.http_client() # Get the pageserver client

    tenant_id = env.initial_tenant
    timeline_id = env.initial_timeline
    client.timeline_detail(tenant_id=tenant_id, timeline_id=timeline_id)

All the test which rely on NeonEnvBuilder, can check the various version combinations of the components. To do this yuo may want to add the parametrize decorator with the function fixtures.utils.allpairs_versions() E.g.

@pytest.mark.parametrize(**fixtures.utils.allpairs_versions())
def test_something(
...

For more information about pytest fixtures, see https://docs.pytest.org/en/stable/fixture.html

At the end of a test, all the nodes in the environment are automatically stopped, so you don't need to worry about cleaning up. Logs and test data are preserved for the analysis, in a directory under ../test_output/<testname>

Before submitting a patch

Ensure that you pass all obligatory checks.

Also consider:

  • Writing a couple of docstrings to clarify the reasoning behind a new test.
  • Adding more type hints to your code to avoid Any, especially:
    • For fixture parameters, they are not automatically deduced.
    • For function arguments and return values.