rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 22:42:57 +00:00

Go to file

Christian Schwarz 9627747d35 bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537 )

Part of [Epic: Bypass PageCache for user data
blocks](https://github.com/neondatabase/neon/issues/7386).

# Problem

`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.

# Background

Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.

The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.

Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.

The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.

The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.

# Changes

1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io` 
5. We can remove the `ephemeral_file::page_caching` construct now.

The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](https://github.com/neondatabase/neon/issues/8130) and
https://github.com/neondatabase/neon/issues/8719 specifically, we need
to do _something_ like this anyways in the near future.

# Alternative Design

The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.

Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> https://github.com/neondatabase/neon/issues/8719

# Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.

We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.

```
baseline: commit 7c74112b2a origin/main
HEAD: ef1c55c52e
```

<details>

```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```

</details>

We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.

<details>

```

  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112b2a origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973
```

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

</details>

# Rollout

This PR changes the code in-place and  is not gated by a feature flag.

2024-08-28 18:31:41 +00:00

.cargo

build: back to opt-level=0 in debug builds, for faster compile times (#5751 )

2023-11-20 15:41:37 +01:00

.config

remove workspace hack from libs (#8780 )

2024-08-21 14:45:32 +01:00

.github

pageserver: do vectored read on each dio-aligned section once (#8763 )

2024-08-28 15:54:42 +01:00

compute_tools

Wait for completion of the upload queue in flush_frozen_layer (#8550 )

2024-08-02 13:07:12 +02:00

control_plane

storcon: track pageserver availability zone (#8852 )

2024-08-28 18:23:55 +01:00

docker-compose

Fix the pg_hintplan flakyness (#8834 )

2024-08-27 12:39:42 +02:00

docs

docs: rolling storage controller restarts RFC (#8310 )

2024-08-28 13:56:14 +00:00

libs

storcon: track pageserver availability zone (#8852 )

2024-08-28 18:23:55 +01:00

pageserver

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537 )

2024-08-28 18:31:41 +00:00

patches

Fix the pg_hintplan flakyness (#8834 )

2024-08-27 12:39:42 +02:00

pgxn

Remove support for pageserver <-> compute protocol version 1 (#8774 )

2024-08-27 18:36:33 +03:00

proxy

proxy: Rename backend types and variants as prep for refactor (#8845 )

2024-08-27 14:12:42 +02:00

safekeeper

safekeeper: reorder routes and their handlers.

2024-08-27 07:37:55 +03:00

scripts

CI(autocomment): add arch to build type (#8809 )

2024-08-23 14:29:11 +01:00

storage_broker

storage broker: only print one line for version and build tag in init (#8624 )

2024-08-07 09:14:26 +02:00

storage_controller

storcon: track pageserver availability zone (#8852 )

2024-08-28 18:23:55 +01:00

storage_scrubber

fix(storage-scrubber): make retry error into warnings (#8851 )

2024-08-28 13:39:21 -04:00

test_runner

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537 )

2024-08-28 18:31:41 +00:00

vendor

Update Postgres 16 to 16.4

2024-08-22 18:03:45 -05:00

workspace_hack

remove workspace hack from libs (#8780 )

2024-08-21 14:45:32 +01:00

.dockerignore

Rename S3 scrubber to storage scrubber (#8013 )

2024-06-11 22:45:22 +00:00

.git-blame-ignore-revs

Add .git-blame-ignore-revs file (#2318 )

2022-08-22 16:38:31 +01:00

.gitattributes

devx: nicer diff hunk headers (#8482 )

2024-07-24 16:50:49 +01:00

.gitignore

Add new compaction abstraction, simulator, and implementation. (#6830 )

2024-02-27 17:15:46 +01:00

.gitmodules

Feat/postgres 16 (#4761 )

2023-09-12 15:11:32 +02:00

.neon_clippy_args

clippy-deny the todo!() macro (#4340 )

2024-06-25 18:03:27 +00:00

Cargo.lock

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537 )

2024-08-28 18:31:41 +00:00

Cargo.toml

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537 )

2024-08-28 18:31:41 +00:00

clippy.toml

tokio-epoll-uring: retry on launch failures due to locked memory (#7141 )

2024-03-15 19:46:15 +00:00

CODEOWNERS

CODEOWNERS: collapse safekeepers into storage (#8510 )

2024-07-26 13:29:59 +00:00

CONTRIBUTING.md

CI: use build-tools image from dockerhub (#6795 )

2024-02-28 12:38:11 +00:00

deny.toml

proxy: start of jwk cache (#8690 )

2024-08-14 13:35:29 +01:00

diesel.toml

Clean up 'attachment service' names to storage controller (#7326 )

2024-04-05 16:18:00 +01:00

Dockerfile

CI(neon-image): add ARM-specific RUSTFLAGS (#8566 )

2024-08-14 17:03:21 +01:00

Dockerfile.build-tools

Dockerfiles: remove cachepot (#8666 )

2024-08-09 15:48:16 +01:00

Dockerfile.compute-node

Fix the pg_hintplan flakyness (#8834 )

2024-08-27 12:39:42 +02:00

LICENSE

Add LICENSE and COPYRIGHT files.

2021-05-27 15:33:08 +03:00

Makefile

build: mark target/ and pg_install/ with CACHEDIR.TAG (#8448 )

2024-07-22 17:32:25 +02:00

NOTICE

Update copyright notice, set it to current year (#6671 )

2024-02-08 00:48:31 +01:00

poetry.lock

chore(deps): bump aiohttp from 3.9.4 to 3.10.2 (#8684 )

2024-08-11 12:21:32 +01:00

pre-commit.py

Check that TERM != dumb before using colors in pre-commit.py

2024-08-22 18:03:45 -05:00

pyproject.toml

chore(deps): bump aiohttp from 3.9.4 to 3.10.2 (#8684 )

2024-08-11 12:21:32 +01:00

pytest.ini

Improve pytest ergonomics

2022-10-04 14:53:01 +03:00

README.md

Require poetry >=1.8 (#8812 )

2024-08-23 11:48:26 +01:00

run_clippy.sh

build: run clippy for powerset of features (#4077 )

2023-04-27 15:01:27 +03:00

rust-toolchain.toml

CI(build-tools): update Rust, Python, Mold (#8667 )

2024-08-09 06:17:16 +00:00

vm-image-spec.yaml

fix(sql-exporter): Remove tenant_id from compute_logical_snapshot_files

2024-08-27 00:51:23 +02:00

README.md

Neon

Neon is a serverless open-source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.

Quick start

Try the Neon Free Tier to create a serverless Postgres instance. Then connect to it with your preferred Postgres client (psql, dbeaver, etc) or use the online SQL Editor. See Connect from any application for connection instructions.

Alternatively, compile and run the project locally.

Architecture overview

A Neon installation consists of compute nodes and the Neon storage engine. Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.

The Neon storage engine consists of two major components:

Pageserver: Scalable storage backend for the compute nodes.
Safekeepers: The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.

See developer documentation in SUMMARY.md for more information.

Running local installation

Installing dependencies on Linux

Install build dependencies and other applicable packages

On Ubuntu or Debian, this set of packages should be sufficient to build the code:

apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python3-poetry lsof libicu-dev

On Fedora, these packages are needed:

dnf install flex bison readline-devel zlib-devel openssl-devel \
  libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
  protobuf-devel libcurl-devel openssl poetry lsof libicu-devel libpq-devel python3-devel \
  libffi-devel

On Arch based systems, these packages are needed:

pacman -S base-devel readline zlib libseccomp openssl clang \
postgresql-libs cmake postgresql protobuf curl lsof

Building Neon requires 3.15+ version of protoc (protobuf-compiler). If your distribution provides an older version, you can install a newer version from here.

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing dependencies on macOS (12.3.1)

Install XCode and dependencies

xcode-select --install
brew install protobuf openssl flex bison icu4c pkg-config

# add openssl to PATH, required for ed25519 keys generation in neon_local
echo 'export PATH="$(brew --prefix openssl)/bin:$PATH"' >> ~/.zshrc

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install PostgreSQL Client

# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq

Rustc version

The project uses rust toolchain file to define the version it's built with in CI for testing and local builds.

This file is automatically picked up by rustup that installs (if absent) and uses the toolchain version pinned in the file.

rustup users who want to build with another toolchain can use the rustup override command to set a specific toolchain for the project's directory.

non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify that their toolchain matches the version in the file. Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.

Building on Linux

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`nproc` -s"
# Remove -s for the verbose build log

make -j`nproc` -s

Building on OSX

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`sysctl -n hw.logicalcpu` -s"
# Remove -s for the verbose build log

make -j`sysctl -n hw.logicalcpu` -s

Dependency installation notes

To run the psql client, install the postgresql-client package or modify PATH and LD_LIBRARY_PATH to include pg_install/bin and pg_install/lib, respectively.

To run the integration tests or Python scripts (not required to use the code), install Python (3.9 or higher), and install the python3 packages using ./scripts/pysync (requires poetry>=1.8) in the project directory.

Running neon database

Start pageserver and postgres on top of it (should be called from repo root):

# Create repository in .neon with proper paths to binaries and data
# Later that would be responsibility of a package install script
> cargo neon init
Initializing pageserver node 1 at '127.0.0.1:64000' in ".neon"

# start pageserver, safekeeper, and broker for their intercommunication
> cargo neon start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2918372
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon".
pageserver started, pid: 2918386
Starting safekeeper at '127.0.0.1:5454' in '.neon/safekeepers/sk1'.
safekeeper 1 started, pid: 2918437

# create initial tenant and use it as a default for every future neon_local invocation
> cargo neon tenant create --set-default
tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one

# create postgres compute node
> cargo neon endpoint create main

# start postgres compute node
> cargo neon endpoint start main
Starting new endpoint main (PostgreSQL v14) on timeline de200bd42b49cc1814412c7e592dd6e9 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55432/postgres'

# check list of running postgres instances
> cargo neon endpoint list
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS
 main      127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main         0/16B5BA8  running

Now, it is possible to connect to postgres and run some queries:

> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
INSERT 0 1
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

And create branches and run postgres on them:

# create branch named migration_check
> cargo neon timeline branch --branch-name migration_check
Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c. Ancestor timeline: 'main'

# check branches tree
> cargo neon timeline list
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]

# create postgres on that branch
> cargo neon endpoint create migration_check --branch-name migration_check

# start postgres on that branch
> cargo neon endpoint start migration_check
Starting new endpoint migration_check (PostgreSQL v14) on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55434/postgres'

# check the new list of running postgres instances
> cargo neon endpoint list
 ENDPOINT         ADDRESS          TIMELINE                          BRANCH NAME      LSN        STATUS
 main             127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main             0/16F9A38  running
 migration_check  127.0.0.1:55434  b3b863fa45fa9e57e615f9f2d944e601  migration_check  0/16F9A70  running

# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p 55434 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

postgres=# insert into t values(2,2);
INSERT 0 1

# check that the new change doesn't affect the 'main' postgres
> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

If you want to run tests afterwards (see below), you must stop all the running pageserver, safekeeper, and postgres instances you have just started. You can terminate them all with one command:

> cargo neon stop

More advanced usages can be found at Control Plane and Neon Local.

Handling build failures

If you encounter errors during setting up the initial tenant, it's best to stop everything (cargo neon stop) and remove the .neon directory. Then fix the problems, and start the setup again.

Running tests

Rust unit tests

We are using cargo-nextest to run the tests in Github Workflows. Some crates do not support running plain cargo test anymore, prefer cargo nextest run instead. You can install cargo-nextest with cargo install cargo-nextest.

Integration tests

Ensure your dependencies are installed as described here.

git clone --recursive https://github.com/neondatabase/neon.git

CARGO_BUILD_FLAGS="--features=testing" make

./scripts/pytest

By default, this runs both debug and release modes, and all supported postgres versions. When testing locally, it is convenient to run just one set of permutations, like this:

DEFAULT_PG_VERSION=16 BUILD_TYPE=release ./scripts/pytest

Flamegraphs

You may find yourself in need of flamegraphs for software in this repository. You can use flamegraph-rs or the original flamegraph.pl. Your choice!

Important

If you're using lld or mold, you need the --no-rosegment linker argument. It's a general thing with Rust / lld / mold, not specific to this repository. See this PR for further instructions.

Cleanup

For cleaning up the source tree from build artifacts, run make clean in the source directory.

For removing every artifact from build and configure steps, run make distclean, and also consider removing the cargo binaries in the target directory, as well as the database in the .neon directory. Note that removing the .neon directory will remove your database, with all data in it. You have been warned!

Documentation

docs Contains a top-level overview of all available markdown documentation.

sourcetree.md contains overview of source tree layout.

To view your rustdoc documentation in a browser, try running cargo doc --no-deps --open

See also README files in some source directories, and rustdoc style documentation comments.

Other resources:

SELECT 'Hello, World': Blog post by Nikita Shamgunov on the high level architecture
Architecture decisions in Neon: Blog post by Heikki Linnakangas
Neon: Serverless PostgreSQL!: Presentation on storage system by Heikki Linnakangas in the CMU Database Group seminar series

Postgres-specific terms

Due to Neon's very close relation with PostgreSQL internals, numerous specific terms are used. The same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.

To get more familiar with this aspect, refer to:

Neon glossary
PostgreSQL glossary
Other PostgreSQL documentation and sources (Neon fork sources can be found here)

Join the development

Read CONTRIBUTING.md to learn about project code style and practices.
To get familiar with a source tree layout, use sourcetree.md.
To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html

Languages

Rust 73.5%

Python 19.4%

C 5.2%

Dockerfile 0.8%

Shell 0.3%

Other 0.8%