Files
neon/test_runner
Christian Schwarz 9627747d35 bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537)
Part of [Epic: Bypass PageCache for user data
blocks](https://github.com/neondatabase/neon/issues/7386).

# Problem

`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.

# Background

Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.

The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.

Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.

The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.

The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.

# Changes

1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io` 
5. We can remove the `ephemeral_file::page_caching` construct now.

The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](https://github.com/neondatabase/neon/issues/8130) and
https://github.com/neondatabase/neon/issues/8719 specifically, we need
to do _something_ like this anyways in the near future.

# Alternative Design

The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.

Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> https://github.com/neondatabase/neon/issues/8719

# Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.

We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.

```
baseline: commit 7c74112b2a origin/main
HEAD: ef1c55c52e
```

<details>

```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```

</details>

We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.

<details>

```

  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112b2a origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973
```

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

</details>

# Rollout

This PR changes the code in-place and  is not gated by a feature flag.
2024-08-28 18:31:41 +00:00
..

Neon test runner

This directory contains integration tests.

Prerequisites:

  • Correctly configured Python, see /docs/sourcetree.md
  • Neon and Postgres binaries
    • See the root README.md for build directions If you want to test tests with test-only APIs, you would need to add --features testing to Rust code build commands. For convenience, repository cargo config contains build_testing alias, that serves as a subcommand, adding the required feature flags. Usage example: cargo build_testing --release is equivalent to cargo build --features testing --release
    • Tests can be run from the git tree; or see the environment variables below to run from other directories.
  • The neon git repo, including the postgres submodule (for some tests, e.g. pg_regress)

Test Organization

Regression tests are in the 'regress' directory. They can be run in parallel to minimize total runtime. Most regression test sets up their environment with its own pageservers and safekeepers (but see TEST_SHARED_FIXTURES).

'pg_clients' contains tests for connecting with various client libraries. Each client test uses a Dockerfile that pulls an image that contains the client, and connects to PostgreSQL with it. The client tests can be run against an existing PostgreSQL or Neon installation.

'performance' contains performance regression tests. Each test exercises a particular scenario or workload, and outputs measurements. They should be run serially, to avoid the tests interfering with the performance of each other. Some performance tests set up their own Neon environment, while others can be run against an existing PostgreSQL or Neon environment.

Running the tests

There is a wrapper script to invoke pytest: ./scripts/pytest. It accepts all the arguments that are accepted by pytest. Depending on your installation options pytest might be invoked directly.

Test state (postgres data, pageserver state, and log files) will be stored under a directory test_output.

You can run all the tests with:

./scripts/pytest

If you want to run all the tests in a particular file:

./scripts/pytest test_pgbench.py

If you want to run all tests that have the string "bench" in their names:

./scripts/pytest -k bench

To run tests in parellel we utilize pytest-xdist plugin. By default everything runs single threaded. Number of workers can be specified with -n argument:

./scripts/pytest -n4

By default performance tests are excluded. To run them explicitly pass performance tests selection to the script:

./scripts/pytest test_runner/performance

Useful environment variables:

NEON_BIN: The directory where neon binaries can be found. POSTGRES_DISTRIB_DIR: The directory where postgres distribution can be found. Since pageserver supports several postgres versions, POSTGRES_DISTRIB_DIR must contain a subdirectory for each version with naming convention v{PG_VERSION}/. Inside that dir, a bin/postgres binary should be present. DEFAULT_PG_VERSION: The version of Postgres to use, This is used to construct full path to the postgres binaries. Format is 2-digit major version nubmer, i.e. DEFAULT_PG_VERSION=16 TEST_OUTPUT: Set the directory where test state and test output files should go. TEST_SHARED_FIXTURES: Try to re-use a single pageserver for all the tests. RUST_LOG: logging configuration to pass into Neon CLI

Useful parameters and commands:

--preserve-database-files to preserve pageserver (layer) and safekeer (segment) timeline files on disk after running a test suite. Such files might be large, so removed by default; but might be useful for debugging or creation of svg images with layer file contents. If NeonEnvBuilder#preserve_database_files set to True for a particular test, the whole repo directory will be attached to Allure report (thus uploaded to S3) as everything.tar.zst for this test.

Let stdout, stderr and INFO log messages go to the terminal instead of capturing them: ./scripts/pytest -s --log-cli-level=INFO ... (Note many tests capture subprocess outputs separately, so this may not show much.)

Exit after the first test failure: ./scripts/pytest -x ... (there are many more pytest options; run pytest -h to see them.)

Running Python tests against real S3 or S3-compatible services

Neon's libs/remote_storage supports multiple implementations of remote storage. At the time of writing, that is

pub enum RemoteStorageKind {
    /// Storage based on local file system.
    /// Specify a root folder to place all stored files into.
    LocalFs(Utf8PathBuf),
    /// AWS S3 based storage, storing all files in the S3 bucket
    /// specified by the config
    AwsS3(S3Config),
    /// Azure Blob based storage, storing all files in the container
    /// specified by the config
    AzureContainer(AzureConfig),
}

The test suite has a Python enum with equal name but different meaning:

@enum.unique
class RemoteStorageKind(str, enum.Enum):
    LOCAL_FS = "local_fs"
    MOCK_S3 = "mock_s3"
    REAL_S3 = "real_s3"
  • LOCAL_FS => LocalFs
  • MOCK_S3: starts moto's S3 implementation, then configures Pageserver with AwsS3
  • REAL_S3 => configure AwsS3 as detailed below

When a test in the test suite needs an AwsS3, it is supposed to call remote_storage.s3_storage(). That function checks env var ENABLE_REAL_S3_REMOTE_STORAGE:

  • If it is not set, use MOCK_S3
  • If it is set, use REAL_S3.

For REAL_S3, the test suite creates the dict/toml representation of the RemoteStorageKind::AwsS3 based on env vars:

pub struct S3Config {
    // test suite env var: REMOTE_STORAGE_S3_BUCKET
    pub bucket_name: String,
    // test suite env var: REMOTE_STORAGE_S3_REGION
    pub bucket_region: String,
    // test suite determines this
    pub prefix_in_bucket: Option<String>,
    // no env var exists; test suite sets it for MOCK_S3, because that's how moto works
    pub endpoint: Option<String>,
    ...
}

Credentials are not part of the config, but discovered by the AWS SDK. See the libs/remote_storage Rust code. We're documenting two mechanism here:

The test suite supports two mechanisms (remote_storage.py):

Credential mechanism 1: env vars AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Populate the env vars with AWS access keys that you created in IAM. Our CI uses this mechanism. However, it is not recommended for interactive use by developers (learn more). Instead, use profiles (next section).

Credential mechanism 2: env var AWS_PROFILE. This uses the AWS SDK's (and CLI's) profile mechanism. Learn more about it in the official docs. After configuring a profile (e.g. via the aws CLI), set the env var to its name.

In conclusion, the full command line is:

# with long-term AWS access keys
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=eu-central-1 \
AWS_ACCESS_KEY_ID=... \
AWS_SECRET_ACCESS_KEY=... \
./scripts/pytest
# with AWS PROFILE
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=eu-central-1 \
AWS_PROFILE=... \
./scripts/pytest

If you're using SSO, make sure to aws sso login --profile $AWS_PROFILE first.

Minio

If you want to run test without the cloud setup, we recommend minio.

# Start in Terminal 1
mkdir /tmp/minio_data
minio server /tmp/minio_data --console-address 127.0.0.1:9001 --address 127.0.0.1:9000

In another terminal, create an aws CLI profile for it:

# append to ~/.aws/config
[profile local-minio]
services = local-minio-services
[services local-minio-services]
s3 =
  endpoint_url=http://127.0.0.1:9000/

Now configure the credentials (this is going to write ~/.aws/credentials for you). It's an interactive prompt.

# Terminal 2
$ aws --profile local-minio configure
AWS Access Key ID [None]: minioadmin
AWS Secret Access Key [None]: minioadmin
Default region name [None]:
Default output format [None]:

Now create a bucket testbucket using the CLI.

# (don't forget to have AWS_PROFILE env var set; or use --profile)
aws --profile local-minio s3 mb s3://mybucket

(If it doesn't work, make sure you update your AWS CLI to a recent version. The service-specific endpoint feature that we're using is quite new.)

# with AWS PROFILE
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=doesntmatterforminio \
AWS_PROFILE=local-minio \
./scripts/pytest

NB: you can avoid the --profile by setting the AWS_PROFILE variable. Just like the AWS SDKs, the aws CLI is sensible to it.

Running Rust tests against real S3 or S3-compatible services

We have some Rust tests that only run against real S3, e.g., here.

They use the same env vars as the Python test suite (see previous section) but interpret them on their own. However, at this time, the interpretation is identical.

So, above instructions apply to the Rust test as well.

Writing a test

Every test needs a Neon Environment, or NeonEnv to operate in. A Neon Environment is like a little cloud-in-a-box, and consists of a Pageserver, 0-N Safekeepers, and compute Postgres nodes. The connections between them can be configured to use JWT authentication tokens, and some other configuration options can be tweaked too.

The easiest way to get access to a Neon Environment is by using the neon_simple_env fixture. The 'simple' env may be shared across multiple tests, so don't shut down the nodes or make other destructive changes in that environment. Also don't assume that there are no tenants or branches or data in the cluster. For convenience, there is a branch called empty, though. The convention is to create a test-specific branch of that and load any test data there, instead of the 'main' branch.

For more complicated cases, you can build a custom Neon Environment, with the neon_env fixture:

def test_foobar(neon_env_builder: NeonEnvBuilder):
    # Prescribe the environment.
    # We want to have 3 safekeeper nodes, and use JWT authentication in the
    # connections to the page server
    neon_env_builder.num_safekeepers = 3
    neon_env_builder.set_pageserver_auth(True)

    # Now create the environment. This initializes the repository, and starts
    # up the page server and the safekeepers
    env = neon_env_builder.init_start()

    # Run the test
    ...

The env includes a default tenant and timeline. Therefore, you do not need to create your own tenant/timeline for testing.

def test_foobar2(neon_env_builder: NeonEnvBuilder):
    env = neon_env_builder.init_start() # Start the environment
    with env.endpoints.create_start("main") as endpoint:
        # Start the compute endpoint
    client = env.pageserver.http_client() # Get the pageserver client

    tenant_id = env.initial_tenant
    timeline_id = env.initial_timeline
    client.timeline_detail(tenant_id=tenant_id, timeline_id=timeline_id)

For more information about pytest fixtures, see https://docs.pytest.org/en/stable/fixture.html

At the end of a test, all the nodes in the environment are automatically stopped, so you don't need to worry about cleaning up. Logs and test data are preserved for the analysis, in a directory under ../test_output/<testname>

Before submitting a patch

Ensure that you pass all obligatory checks.

Also consider:

  • Writing a couple of docstrings to clarify the reasoning behind a new test.
  • Adding more type hints to your code to avoid Any, especially:
    • For fixture parameters, they are not automatically deduced.
    • For function arguments and return values.