# Problem The Pageserver read path exclusively uses direct IO if `virtual_file_io_mode=direct`. The write path is half-finished. Here is what the various writing components use: |what|buffering|flags on <br/>`v_f_io_mode`<br/>=`buffered`|flags on <br/>`virtual_file_io_mode`<br/>=`direct`| |-|-|-|-| |`DeltaLayerWriter`| BlobWriter<BUFFERED=true> | () | () | |`ImageLayerWriter`| BlobWriter<BUFFERED=false> | () | () | |`download_layer_file`|BufferedWriter|()|()| |`InMemoryLayer`|BufferedWriter|()|O_DIRECT| The vehicle towards direct IO support is `BufferedWriter` which - largely takes care of O_DIRECT alignment & size-multiple requirements - double-buffering to mask latency `DeltaLayerWriter`, `ImageLayerWriter` use `blob_io::BlobWriter` , which has neither of these. # Changes ## High-Level At a high-level this PR makes the following primary changes: - switch the two layer writer types to use `BufferedWriter` & make sensitive to `virtual_file_io_mode` (via open_with_options_**v2**) - make `download_layer_file` sensitive to `virtual_file_io_mode` (also via open_with_options_**v2**) - add `virtual_file_io_mode=direct-rw` as a feature gate - we're hackish-ly piggybacking on OpenOptions's ask for write access here - this means with just `=direct` InMemoryLayer reads and writes no longer uses O_DIRECT - this is transitory and we'll remove the `direct-rw` variant once the rollout is complete (The `_v2` APIs for opening / creating VirtualFile are those that are sensitive to `virtual_file_io_mode`) The result is: |what|uses <br/>`BufferedWriter`|flags on <br/>`v_f_io_mode`<br/>=`buffered`|flags on <br/>`v_f_io_mode`<br/>=`direct`|flags on <br/>`v_f_io_mode`<br/>=`direct-rw`| |-|-|-|-|-| |`DeltaLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT | |`ImageLayerWriter`| ~~Blob~~BufferedWriter | () | () | O_DIRECT | |`download_layer_file`|BufferedWriter|()|()|O_DIRECT| |`InMemoryLayer`|BufferedWriter|()|~~O_DIRECT~~()|O_DIRECT| ## Code-Level The main change is: - Switch `blob_io::BlobWriter` away from its own buffering method to use `BufferedWriter`. Additional prep for upholding `O_DIRECT` requirements: - Layer writer `finish()` methods switched to use IoBufferMut for guaranteed buffer address alignment. The size of the buffers is PAGE_SZ and thereby implicitly assumed to fulfill O_DIRECT requirements. For the hacky feature-gating via `=direct-rw`: - Track `OpenOptions::write(true|false)` in a field; bunch of mechanical churn. - Consolidate the APIs in which we "open" or "create" VirtualFile for better overview over which parts of the code use the `_v2` APIs. Necessary refactorings & infra work: - Add doc comments explaining how BufferedWriter ensures that writes are compliant with O_DIRECT alignment & size constraints. This isn't new, but should be spelled out. - Add the concept of shutdown modes to `BufferedWriter::shutdown` to make writer shutdown adhere to these constraints. - The `PadThenTruncate` mode might not be necessary in practice because I believe all layer files ever written are sized in multiples `PAGE_SZ` and since `PAGE_SZ` is larger than the current alignment requirements (512/4k depending on platform), it won't be necesary to pad. - Some test (I believe `round_trip_test_compressed`?) required it though - [ ] TODO: decide if we want to accept that complexity; if we do then address TODO in the code to separate alignment requirement from buffer capacity - Add `set_len` (=`ftruncate`) VirtualFile operation to support the above. - Allow `BufferedWriter` to start at a non-zero offset (to make room for the summary block). Cleanups unlocked by this change: - Remove non-positional APIs from VirtualFile (e.g. seek, write_full, read_full) Drive-by fixes: - PR https://github.com/neondatabase/neon/pull/11585 aimed to run unit tests for all `virtual_file_io_mode` combinations but didn't because of a missing `_` in the env var. # Performance This section assesses this PR's impact on deployments with current production setting (`=direct`) and anticipated impact of switching to (`=direct-rw`). For `DeltaLayerWriter`, `=direct` should remain unchanged to slightly improved on throughput because the `BlobWriter`'s buffer had the same size as the `BufferedWriter`'s buffer, but it didn't have the double-buffering that `BufferedWriter` has. The `=direct-rw` enables direct IO; throughput should not be suffering because of double-buffering; benchmarks will show if this is true. The `ImageLayerWriter` was previously not doing any buffering (`BUFFERED=false`). It went straight to issuing the IO operation to the underlying VirtualFile and the buffering was done by the kernel. The switch to `BufferedWriter` under `=direct` adds an additional memcpy into the BufferedWriter's buffer. We will win back that memcpy when enabling direct IO via `=direct-rw`. A nice win from the switch to `BufferedWriter` is that ImageLayerWriter performs >=16x fewer write operations to VirtualFile (the BlobWriter performs one write per len field and one write per image value). This should save low tens of microseconds of CPU overhead from doing all these syscalls/io_uring operations, regardless of `=direct` or `=direct-rw`. Aside from problems with alignment, this write frequency without double-buffering is prohibitive if we actually have to wait for the disk, which is what will happen when we enable direct IO via (`=direct-rw`). Throughput should not be suffering because of BufferedWrite's double-buffering; benchmarks will show if this is true. `InMemoryLayer` at `=direct` will flip back to using buffered IO but remain on BufferedWriter. The buffered IO adds back one memcpy of CPU overhead. Throughput should not suffer and will might improve on not-memory-pressured Pageservers but let's remember that we're doing the whole direct IO thing to eliminate global memory pressure as a source of perf variability. ## bench_ingest I reran `bench_ingest` on `im4gn.2xlarge` and `Hetzner AX102`. Use `git diff` with `--word-diff` or similar to see the change. General guidance on interpretation: - immediate production impact of this PR without production config change can be gauged by comparing the same `io_mode=Direct` - end state of production switched over to `io_mode=DirectRw` can be gauged by comparing old results' `io_mode=Direct` to new results' `io_mode=DirectRw` Given above guidance, on `im4gn.2xlarge` - immediate impact is a significant improvement in all cases - end state after switching has same significant improvements in all cases - ... except `ingest/io_mode=DirectRw volume_mib=128 key_size_bytes=8192 key_layout=Sequential write_delta=Yes` which only achieves `238 MiB/s` instead of `253.43 MiB/s` - this is a 6% degradation - this workload is typical for image layer creation # Refs - epic https://github.com/neondatabase/neon/issues/9868 - stacked atop - preliminary refactor https://github.com/neondatabase/neon/pull/11549 - bench_ingest overhaul https://github.com/neondatabase/neon/pull/11667 - derived from https://github.com/neondatabase/neon/pull/10063 Co-authored-by: Yuchen Liang <yuchen@neon.tech>
Neon test runner
This directory contains integration tests.
Prerequisites:
- Correctly configured Python, see
/docs/sourcetree.md - Neon and Postgres binaries
- See the root README.md for build directions
To run tests you need to add
--features testingto Rust code build commands. For convenience, repository cargo config containsbuild_testingalias, that serves as a subcommand, adding the required feature flags. Usage example:cargo build_testing --releaseis equivalent tocargo build --features testing --release - Tests can be run from the git tree; or see the environment variables below to run from other directories.
- See the root README.md for build directions
To run tests you need to add
- The neon git repo, including the postgres submodule
(for some tests, e.g.
pg_regress)
Test Organization
Regression tests are in the 'regress' directory. They can be run in parallel to minimize total runtime. Most regression test sets up their environment with its own pageservers and safekeepers.
'pg_clients' contains tests for connecting with various client libraries. Each client test uses a Dockerfile that pulls an image that contains the client, and connects to PostgreSQL with it. The client tests can be run against an existing PostgreSQL or Neon installation.
'performance' contains performance regression tests. Each test exercises a particular scenario or workload, and outputs measurements. They should be run serially, to avoid the tests interfering with the performance of each other. Some performance tests set up their own Neon environment, while others can be run against an existing PostgreSQL or Neon environment.
Running the tests
There is a wrapper script to invoke pytest: ./scripts/pytest.
It accepts all the arguments that are accepted by pytest.
Depending on your installation options pytest might be invoked directly.
Test state (postgres data, pageserver state, and log files) will
be stored under a directory test_output.
You can run all the tests with:
./scripts/pytest
If you want to run all the tests in a particular file:
./scripts/pytest test_pgbench.py
If you want to run all tests that have the string "bench" in their names:
./scripts/pytest -k bench
To run tests in parellel we utilize pytest-xdist plugin. By default everything runs single threaded. Number of workers can be specified with -n argument:
./scripts/pytest -n4
By default performance tests are excluded. To run them explicitly pass performance tests selection to the script:
./scripts/pytest test_runner/performance
Useful environment variables:
NEON_BIN: The directory where neon binaries can be found.
COMPATIBILITY_NEON_BIN: The directory where the previous version of Neon binaries can be found
POSTGRES_DISTRIB_DIR: The directory where postgres distribution can be found.
Since pageserver supports several postgres versions, POSTGRES_DISTRIB_DIR must contain
a subdirectory for each version with naming convention v{PG_VERSION}/.
Inside that dir, a bin/postgres binary should be present.
COMPATIBILITY_POSTGRES_DISTRIB_DIR: The directory where the prevoius version of postgres distribution can be found.
DEFAULT_PG_VERSION: The version of Postgres to use,
This is used to construct full path to the postgres binaries.
Format is 2-digit major version nubmer, i.e. DEFAULT_PG_VERSION=17
TEST_OUTPUT: Set the directory where test state and test output files
should go.
RUST_LOG: logging configuration to pass into Neon CLI
Useful parameters and commands:
--preserve-database-files to preserve pageserver (layer) and safekeer (segment) timeline files on disk
after running a test suite. Such files might be large, so removed by default; but might be useful for debugging or creation of svg images with layer file contents. If NeonEnvBuilder#preserve_database_files set to True for a particular test, the whole repo directory will be attached to Allure report (thus uploaded to S3) as everything.tar.zst for this test.
Let stdout, stderr and INFO log messages go to the terminal instead of capturing them:
./scripts/pytest -s --log-cli-level=INFO ...
(Note many tests capture subprocess outputs separately, so this may not
show much.)
Exit after the first test failure:
./scripts/pytest -x ...
(there are many more pytest options; run pytest -h to see them.)
Running Python tests against real S3 or S3-compatible services
Neon's libs/remote_storage supports multiple implementations of remote storage.
At the time of writing, that is
pub enum RemoteStorageKind {
/// Storage based on local file system.
/// Specify a root folder to place all stored files into.
LocalFs(Utf8PathBuf),
/// AWS S3 based storage, storing all files in the S3 bucket
/// specified by the config
AwsS3(S3Config),
/// Azure Blob based storage, storing all files in the container
/// specified by the config
AzureContainer(AzureConfig),
}
The test suite has a Python enum with equal name but different meaning:
@enum.unique
class RemoteStorageKind(StrEnum):
LOCAL_FS = "local_fs"
MOCK_S3 = "mock_s3"
REAL_S3 = "real_s3"
LOCAL_FS=>LocalFsMOCK_S3: startsmoto's S3 implementation, then configures Pageserver withAwsS3REAL_S3=> configureAwsS3as detailed below
When a test in the test suite needs an AwsS3, it is supposed to call remote_storage.s3_storage().
That function checks env var ENABLE_REAL_S3_REMOTE_STORAGE:
- If it is not set, use
MOCK_S3 - If it is set, use
REAL_S3.
For REAL_S3, the test suite creates the dict/toml representation of the RemoteStorageKind::AwsS3 based on env vars:
pub struct S3Config {
// test suite env var: REMOTE_STORAGE_S3_BUCKET
pub bucket_name: String,
// test suite env var: REMOTE_STORAGE_S3_REGION
pub bucket_region: String,
// test suite determines this
pub prefix_in_bucket: Option<String>,
// no env var exists; test suite sets it for MOCK_S3, because that's how moto works
pub endpoint: Option<String>,
...
}
Credentials are not part of the config, but discovered by the AWS SDK.
See the libs/remote_storage Rust code.
We're documenting two mechanism here:
The test suite supports two mechanisms (remote_storage.py):
Credential mechanism 1: env vars AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
Populate the env vars with AWS access keys that you created in IAM.
Our CI uses this mechanism.
However, it is not recommended for interactive use by developers (learn more).
Instead, use profiles (next section).
Credential mechanism 2: env var AWS_PROFILE.
This uses the AWS SDK's (and CLI's) profile mechanism.
Learn more about it in the official docs.
After configuring a profile (e.g. via the aws CLI), set the env var to its name.
In conclusion, the full command line is:
# with long-term AWS access keys
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=eu-central-1 \
AWS_ACCESS_KEY_ID=... \
AWS_SECRET_ACCESS_KEY=... \
./scripts/pytest
# with AWS PROFILE
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=eu-central-1 \
AWS_PROFILE=... \
./scripts/pytest
If you're using SSO, make sure to aws sso login --profile $AWS_PROFILE first.
Minio
If you want to run test without the cloud setup, we recommend minio.
# Start in Terminal 1
mkdir /tmp/minio_data
minio server /tmp/minio_data --console-address 127.0.0.1:9001 --address 127.0.0.1:9000
In another terminal, create an aws CLI profile for it:
# append to ~/.aws/config
[profile local-minio]
services = local-minio-services
[services local-minio-services]
s3 =
endpoint_url=http://127.0.0.1:9000/
Now configure the credentials (this is going to write ~/.aws/credentials for you).
It's an interactive prompt.
# Terminal 2
$ aws --profile local-minio configure
AWS Access Key ID [None]: minioadmin
AWS Secret Access Key [None]: minioadmin
Default region name [None]:
Default output format [None]:
Now create a bucket testbucket using the CLI.
# (don't forget to have AWS_PROFILE env var set; or use --profile)
aws --profile local-minio s3 mb s3://mybucket
(If it doesn't work, make sure you update your AWS CLI to a recent version. The service-specific endpoint feature that we're using is quite new.)
# with AWS PROFILE
ENABLE_REAL_S3_REMOTE_STORAGE=true \
REMOTE_STORAGE_S3_BUCKET=mybucket \
REMOTE_STORAGE_S3_REGION=doesntmatterforminio \
AWS_PROFILE=local-minio \
./scripts/pytest
NB: you can avoid the --profile by setting the AWS_PROFILE variable.
Just like the AWS SDKs, the aws CLI is sensible to it.
Running Rust tests against real S3 or S3-compatible services
We have some Rust tests that only run against real S3, e.g., here.
They use the same env vars as the Python test suite (see previous section) but interpret them on their own. However, at this time, the interpretation is identical.
So, above instructions apply to the Rust test as well.
Writing a test
Every test needs a Neon Environment, or NeonEnv to operate in. A Neon Environment is like a little cloud-in-a-box, and consists of a Pageserver, 0-N Safekeepers, and compute Postgres nodes. The connections between them can be configured to use JWT authentication tokens, and some other configuration options can be tweaked too.
The easiest way to get access to a Neon Environment is by using the neon_simple_env
fixture. For convenience, there is a branch called main in environments created with
'neon_simple_env', ready to be used in the test.
For more complicated cases, you can build a custom Neon Environment, with the neon_env
fixture:
def test_foobar(neon_env_builder: NeonEnvBuilder):
# Prescribe the environment.
# We want to have 3 safekeeper nodes, and use JWT authentication in the
# connections to the page server
neon_env_builder.num_safekeepers = 3
neon_env_builder.set_pageserver_auth(True)
# Now create the environment. This initializes the repository, and starts
# up the page server and the safekeepers
env = neon_env_builder.init_start()
# Run the test
...
The env includes a default tenant and timeline. Therefore, you do not need to create your own tenant/timeline for testing.
def test_foobar2(neon_env_builder: NeonEnvBuilder):
env = neon_env_builder.init_start() # Start the environment
with env.endpoints.create_start("main") as endpoint:
# Start the compute endpoint
client = env.pageserver.http_client() # Get the pageserver client
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
client.timeline_detail(tenant_id=tenant_id, timeline_id=timeline_id)
All the test which rely on NeonEnvBuilder, can check the various version combinations of the components. To do this yuo may want to add the parametrize decorator with the function fixtures.utils.allpairs_versions() E.g.
@pytest.mark.parametrize(**fixtures.utils.allpairs_versions())
def test_something(
...
For more information about pytest fixtures, see https://docs.pytest.org/en/stable/fixture.html
At the end of a test, all the nodes in the environment are automatically stopped, so you
don't need to worry about cleaning up. Logs and test data are preserved for the analysis,
in a directory under ../test_output/<testname>
Before submitting a patch
Ensure that you pass all obligatory checks.
Also consider:
- Writing a couple of docstrings to clarify the reasoning behind a new test.
- Adding more type hints to your code to avoid
Any, especially:- For fixture parameters, they are not automatically deduced.
- For function arguments and return values.