rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 15:02:56 +00:00

Go to file

Christian Schwarz 7a63685cde simplify page-caching of EphemeralFile (#4994 )

(This PR is the successor of https://github.com/neondatabase/neon/pull/4984 )

## Summary

The current way in which `EphemeralFile` uses `PageCache` complicates
the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache `EphemeralFile` contents, by exploiting
the append-only nature of `EphemeralFile`.

The result is that `PageCache` only holds `ImmutableFilePage` and
`MaterializedPage`.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from `PageCache`, also
eliminating an entire failure mode.

Futher, many great open-source libraries exist to solve the problem of a
read-only cache,
much better than our `page_cache.rs` (e.g., better replacement policy,
less global locking).
With this PR, we can now explore using them.

## Problem & Analysis

Before this PR, `PageCache` had three types of pages:

* `ImmutableFilePage`: caches Delta / Image layer file contents
* `MaterializedPage`: caches results of Timeline::get (page
materialization)
* `EphemeralPage`: caches `EphemeralFile` contents

`EphemeralPage` is quite different from `ImmutableFilePage` and
`MaterializedPage`:

* Immutable and materialized pages are for the acceleration of (future)
reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of
DRAM.
* Ephemeral pages are a write-back cache of `EphemeralFile` contents,
i.e., if there is pressure in the page cache, we spill `EphemeralFile`
contents to disk.

`EphemeralFile` is only used by `InMemoryLayer`, for the following
purposes:
* **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter
for EphemeralFile`
* **read**: when doing **page reconstruction** for a page@lsn that isn't
written to disk
* **read**: when writing L0 layer files, we re-read the `InMemoryLayer`
and put the contents into the L0 delta writer
(**`create_delta_layer`**). This happens every 10min or when
InMemoryLayer reaches 256MB in size.

The access patterns of the `InMemoryLayer` use case are as follows:

* **write**: via `BlobWriter`, strictly append-only
* **read for page reconstruction**: via `BlobReader`, random
* **read for `create_delta_layer`**: via `BlobReader`, dependent on
data, but generally random. Why?
* in classical LSM terms, this function is what writes the
memory-resident `C0` tree into the disk-resident `C1` tree
* in our system, though, the values of InMemoryLayer are stored in an
EphemeralFile, and hence they are not guaranteed to be memory-resident
* the function reads `Value`s in `Key, LSN` order, which is `!=` insert
order

What do these `EphemeralFile`-level access patterns mean for the page
cache?

* **write**:
* the common case is that `Value` is a WAL record, and if it isn't a
full-page-image WAL record, then it's smaller than `PAGE_SIZE`
* So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE`
sized writes.
* If there's no page cache eviction between subsequent
`InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident,
so the page cache avoids doing a `write` system call.
* In practice, a busy page server will have page cache evictions because
we only configure 64MB of page cache size.
* **reads for page reconstruction**: read acceleration, just as for the
other page types.
* **reads for `create_delta_layer`**:
* The `Value` reads happen through a `BlockCursor`, which optimizes the
case of repeated reads from the same page.
* So, the best case is that subsequent values are located on the same
page; hence `BlockCursor`s buffer is maximally effective.
* The worst case is that each `Value` is on a different page; hence the
`BlockCursor`'s 1-page-sized buffer is ineffective.
* The best case translates into `256MB/PAGE_SIZE` page cache accesses,
one per page.
    * the worst case translates into `#Values` page cache accesses
* again, the page cache accesses must be assumed to be random because
the `Value`s aren't accessed in insertion order but `Key, LSN` order.

## Summary of changes

Preliminaries for this PR were:

- #5003
- #5004 
- #5005 
  - uncommitted microbenchmark in #5011 

Based on the observations outlined above, this PR makes the following
changes:

* Rip out `EphemeralPage` from `page_cache.rs`
* Move the `block_io::FileId` to `page_cache::FileId`
* Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct.
  It's called `mutable_tail`.
* Change `write_blob` to use `mutable_tail` for the write buffering
instead of a page cache page.
* if `mutable_tail` is full, it writes it out to disk, zeroes it out,
and re-uses it.
* There is explicitly no double-buffering, so that memory allocation per
`EphemeralFile` instance is fixed.
* Change `read_blob` to return different `BlockLease` variants depending
on `blknum`
* for the `blknum` that corresponds to the `mutable_tail`, return a ref
to it
* Rust borrowing rules prevent `write_blob` calls while refs are
outstanding.
  * for all non-tail blocks, return a page-cached `ImmutablePage`
* It is safe to page-cache these as ImmutablePage because EphemeralFile
is append-only.

## Performance 

How doe the changes above affect performance?
M claim is: not significantly.

* **write path**:
* before this PR, the `EphemeralFile::write_blob` didn't issue its own
`write` system calls.
* If there were enough free pages, it didn't issue *any* `write` system
calls.
* If it had to evict other `EphemeralPage`s to get pages a page for its
writes (`get_buf_for_write`), the page cache code would implicitly issue
the writeback of victim pages as needed.
* With this PR, `EphemeralFile::write_blob` *always* issues *all* of its
*own* `write` system calls.
* Also, the writes are explicit instead of implicit through page cache
write back, which will help #4743
* The perf impact of always doing the writes is the CPU overhead and
syscall latency.
* Before this PR, we might have never issued them if there were enough
free pages.
* We don't issue `fsync` and can expect the writes to only hit the
kernel page cache.
* There is also an advantage in issuing the writes directly: the perf
impact is paid by the tenant that caused the writes, instead of whatever
tenant evicts the `EphemeralPage`.
* **reads for page reconstruction**: no impact.
* The `write_blob` function pre-warms the page cache when it writes the
`mutable_tail` to disk.
* So, the behavior is the same as with the EphemeralPages before this
PR.
* **reads for `create_delta_layer`**: no impact.
  * Same argument as for page reconstruction.
  * Note for the future:
* going through the page cache likely causes read amplification here.
Why?
* Due to the `Key,Lsn`-ordered access pattern, we don't read all the
values in the page before moving to the next page. In the worst case, we
might read the same page multiple times to read different `Values` from
it.
    * So, it might be better to bypass the page cache here.
    * Idea drafts:
      * bypass PS page cache + prefetch pipeline + iovec-based IO
* bypass PS page cache + use `copy_file_range` to copy from ephemeral
file into the L0 delta file, without going through user space

2023-08-18 20:31:03 +03:00

.cargo

Fix rustdoc warnings and test cargo doc in CI (#4711 )

2023-07-15 05:11:25 +03:00

.config

Update most of the dependencies to their latest versions (#4026 )

2023-04-14 18:28:54 +03:00

.github

Upload test results to DB right after generation (#4967 )

2023-08-15 15:32:30 +01:00

compute_tools

Refactor remote extensions index download.

2023-08-17 12:48:33 +03:00

control_plane

Refactor remote extensions index download.

2023-08-17 12:48:33 +03:00

docker-compose

Fix docker-compose file and document (#4621 )

2023-07-06 10:11:24 +01:00

docs

Tar Remote Extensions (#4715 )

2023-08-02 12:38:12 +03:00

libs

Refactor remote extensions index download.

2023-08-17 12:48:33 +03:00

pageserver

simplify page-caching of EphemeralFile (#4994 )

2023-08-18 20:31:03 +03:00

pgxn

Fix safekeeper recovery with auth (#5035 )

2023-08-18 16:48:55 +01:00

proxy

proxy: pool connection logs (#5020 )

2023-08-18 11:44:08 +01:00

safekeeper

Test that auth on pg/http services can be enabled separately in sks.

2023-08-15 19:31:20 +03:00

scripts

Improve test results format (#4549 )

2023-08-08 20:09:38 +01:00

storage_broker

Use connect_timeout for broker::connect (#4452 )

2023-06-09 17:38:53 +03:00

test_runner

test: allow race with flush and stopped queue (#5027 )

2023-08-18 19:36:25 +03:00

trace

Update most of the dependencies to their latest versions (#4026 )

2023-04-14 18:28:54 +03:00

vendor

Update Postgres to v15.4 and v14.9 (#4965 )

2023-08-14 16:19:45 +01:00

workspace_hack

Tar Remote Extensions (#4715 )

2023-08-02 12:38:12 +03:00

.dockerignore

Upload Test Remote Extensions (#4792 )

2023-07-26 15:24:03 +03:00

.git-blame-ignore-revs

Add .git-blame-ignore-revs file (#2318 )

2022-08-22 16:38:31 +01:00

.gitignore

Revert "Use actual temporary dir for pageserver unit tests"

2023-01-19 20:16:56 +01:00

.gitmodules

Change the branch name for V14 as it does for V15

2022-11-10 17:41:36 +01:00

.neon_clippy_args

build: run clippy for powerset of features (#4077 )

2023-04-27 15:01:27 +03:00

Cargo.lock

Refactor remote extensions index download.

2023-08-17 12:48:33 +03:00

Cargo.toml

proxy: reduce global conn pool contention (#4747 )

2023-08-16 17:20:28 +01:00

CODEOWNERS

Add codeowners section for /vendor/ (#2849 )

2022-11-17 14:31:34 +00:00

CONTRIBUTING.md

docs: add links to the doc for better read experience (#4258 )

2023-05-17 12:25:01 +03:00

deny.toml

Add more information about cargo deny

2023-01-13 13:24:34 +01:00

Dockerfile

Add neon_local to docker image.

2023-08-14 14:08:51 +03:00

Dockerfile.compute-node

build: ca-certificates need to be present (#4880 )

2023-08-02 20:34:21 +03:00

Dockerfile.compute-tools

Use BUILD_TAG in compute_ctl binary. (#4541 )

2023-06-22 17:06:16 +03:00

LICENSE

Add LICENSE and COPYRIGHT files.

2021-05-27 15:33:08 +03:00

Makefile

test_runner: add amcheck to test_compatibility (#4772 )

2023-07-28 16:00:55 +01:00

NOTICE

Prepare pg 15 support (build system and submodules) (#2337 )

2022-09-05 18:30:54 +03:00

poetry.lock

Bump cryptography from 41.0.2 to 41.0.3 (#4870 )

2023-08-02 14:10:36 +03:00

pre-commit.py

Replace flake8 and isort with ruff (#3810 )

2023-03-14 13:25:44 +00:00

pyproject.toml

Bump aiohttp from 3.7.4 to 3.8.5 (#4762 )

2023-07-20 22:33:50 +03:00

pytest.ini

Improve pytest ergonomics

2022-10-04 14:53:01 +03:00

README.md

README: make it a bit clearer how to get regression tests running (#4885 )

2023-08-03 15:23:23 +01:00

run_clippy.sh

build: run clippy for powerset of features (#4077 )

2023-04-27 15:01:27 +03:00

rust-toolchain.toml

Update rust to 1.71.0 (#4718 )

2023-07-14 18:34:01 +02:00

vm-cgconfig.conf

Fix + re-enable VM cgroup creation + running (#3820 )

2023-03-16 17:09:45 -07:00

README.md

Neon

Neon is a serverless open-source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.

Quick start

Try the Neon Free Tier to create a serverless Postgres instance. Then connect to it with your preferred Postgres client (psql, dbeaver, etc) or use the online SQL Editor. See Connect from any application for connection instructions.

Alternatively, compile and run the project locally.

Architecture overview

A Neon installation consists of compute nodes and the Neon storage engine. Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.

The Neon storage engine consists of two major components:

Pageserver. Scalable storage backend for the compute nodes.
Safekeepers. The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.

See developer documentation in SUMMARY.md for more information.

Running local installation

Installing dependencies on Linux

Install build dependencies and other applicable packages

On Ubuntu or Debian, this set of packages should be sufficient to build the code:

apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python-poetry

On Fedora, these packages are needed:

dnf install flex bison readline-devel zlib-devel openssl-devel \
  libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
  protobuf-devel libcurl-devel openssl poetry

On Arch based systems, these packages are needed:

pacman -S base-devel readline zlib libseccomp openssl clang \
postgresql-libs cmake postgresql protobuf curl

Building Neon requires 3.15+ version of protoc (protobuf-compiler). If your distribution provides an older version, you can install a newer version from here.

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing dependencies on macOS (12.3.1)

Install XCode and dependencies

xcode-select --install
brew install protobuf openssl flex bison

# add openssl to PATH, required for ed25519 keys generation in neon_local
echo 'export PATH="$(brew --prefix openssl)/bin:$PATH"' >> ~/.zshrc

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install PostgreSQL Client

# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq

Rustc version

The project uses rust toolchain file to define the version it's built with in CI for testing and local builds.

This file is automatically picked up by rustup that installs (if absent) and uses the toolchain version pinned in the file.

rustup users who want to build with another toolchain can use rustup override command to set a specific toolchain for the project's directory.

non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify their toolchain matches the version in the file. Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.

Building on Linux

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`nproc` -s"
# Remove -s for the verbose build log

make -j`nproc` -s

Building on OSX

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`sysctl -n hw.logicalcpu` -s"
# Remove -s for the verbose build log

make -j`sysctl -n hw.logicalcpu` -s

Dependency installation notes

To run the psql client, install the postgresql-client package or modify PATH and LD_LIBRARY_PATH to include pg_install/bin and pg_install/lib, respectively.

To run the integration tests or Python scripts (not required to use the code), install Python (3.9 or higher), and install python3 packages using ./scripts/pysync (requires poetry>=1.3) in the project directory.

Running neon database

Start pageserver and postgres on top of it (should be called from repo root):

# Create repository in .neon with proper paths to binaries and data
# Later that would be responsibility of a package install script
> cargo neon init
Initializing pageserver node 1 at '127.0.0.1:64000' in ".neon"

# start pageserver, safekeeper, and broker for their intercommunication
> cargo neon start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2918372
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon".
pageserver started, pid: 2918386
Starting safekeeper at '127.0.0.1:5454' in '.neon/safekeepers/sk1'.
safekeeper 1 started, pid: 2918437

# create initial tenant and use it as a default for every future neon_local invocation
> cargo neon tenant create --set-default
tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one

# start postgres compute node
> cargo neon endpoint start main
Starting new endpoint main (PostgreSQL v14) on timeline de200bd42b49cc1814412c7e592dd6e9 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55432/postgres'

# check list of running postgres instances
> cargo neon endpoint list
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS
 main      127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main         0/16B5BA8  running

Now, it is possible to connect to postgres and run some queries:

> psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
INSERT 0 1
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

And create branches and run postgres on them:

# create branch named migration_check
> cargo neon timeline branch --branch-name migration_check
Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c. Ancestor timeline: 'main'

# check branches tree
> cargo neon timeline list
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]

# start postgres on that branch
> cargo neon endpoint start migration_check --branch-name migration_check
Starting new endpoint migration_check (PostgreSQL v14) on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55434/postgres'

# check the new list of running postgres instances
> cargo neon endpoint list
 ENDPOINT         ADDRESS          TIMELINE                          BRANCH NAME      LSN        STATUS
 main             127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main             0/16F9A38  running
 migration_check  127.0.0.1:55434  b3b863fa45fa9e57e615f9f2d944e601  migration_check  0/16F9A70  running

# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p55434 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

postgres=# insert into t values(2,2);
INSERT 0 1

# check that the new change doesn't affect the 'main' postgres
> psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

If you want to run tests afterward (see below), you must stop all the running of the pageserver, safekeeper, and postgres instances you have just started. You can terminate them all with one command:

> cargo neon stop

Running tests

Ensure your dependencies are installed as described here.

git clone --recursive https://github.com/neondatabase/neon.git

CARGO_BUILD_FLAGS="--features=testing" make

./scripts/pytest

By default, this runs both debug and release modes, and all supported postgres versions. When testing locally, it is convenient to run just run one set of permutations, like this:

DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest

Documentation

docs Contains a top-level overview of all available markdown documentation.

sourcetree.md contains overview of source tree layout.

To view your rustdoc documentation in a browser, try running cargo doc --no-deps --open

See also README files in some source directories, and rustdoc style documentation comments.

Other resources:

SELECT 'Hello, World': Blog post by Nikita Shamgunov on the high level architecture
Architecture decisions in Neon: Blog post by Heikki Linnakangas
Neon: Serverless PostgreSQL!: Presentation on storage system by Heikki Linnakangas in the CMU Database Group seminar series

Postgres-specific terms

Due to Neon's very close relation with PostgreSQL internals, numerous specific terms are used. The same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.

To get more familiar with this aspect, refer to:

Neon glossary
PostgreSQL glossary
Other PostgreSQL documentation and sources (Neon fork sources can be found here)

Join the development

Read CONTRIBUTING.md to learn about project code style and practices.
To get familiar with a source tree layout, use sourcetree.md.
To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html

Languages

Rust 73.5%

Python 19.4%

C 5.2%

Dockerfile 0.8%

Shell 0.3%

Other 0.8%