rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-06-04 22:10:39 +00:00

Go to file

Christian Schwarz fc66ba43c4 Revert "revert recent VirtualFile asyncification changes (#5291 )" (#6309 )

This reverts commit ab1f37e908.
Thereby
fixes #5479

Updated Analysis
================

The problem with the original patch was that it, for the first time,
exposed the `VirtualFile` code to tokio task concurrency instead of just
thread-based concurrency. That caused the VirtualFile file descriptor
cache to start thrashing, effectively grinding the system to a halt.

Details
-------

At the time of the original patch, we had a _lot_ of runnable tasks in
the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is
that our production systems fell into a valley of zero goodput, high
CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent
initial logical size calculations.
Initial size calculation was initiated for all timelines within the
first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the `VirtualFile` code never returned
`Poll::Pending`.
That meant that once we entered it, the calling tokio task would not
yield to the tokio executor until we were done performing the
VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's
synchronization primitives to those from `tokio::sync`.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return `Poll::Pending`.
But, the tokio scheduler makes tokio sync primitives return `Pending`
*artificially*, as a mechanism for the scheduler to get back into
control more often
([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)).

So, the new reality was that VirtualFile calls could now yield to the
tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing,
most notably the VirtualFile file descriptor cache: each time a task
would be rescheduled, it would want to do its next VirtualFile
operation. For that, it would first need to evict another (task's)
VirtualFile fd from the cache to make room for its own fd. It would then
do one VirtualFile operation before hitting an await point and yielding
to the executor again. The executor would run the other 19999 tasks for
fairness before circling back to the first task, which would find its fd
evicted.

The other cache that would theoretically be impacted in a similar way is
the pageserver's `PageCache`.
However, for initial logical size calculation, it seems much less
relevant in experiments, likely because of the random access nature of
initial logical size calculation.

Fixes
=====

We fixed the above problems by
- raising VirtualFile cache sizes
  - https://github.com/neondatabase/cloud/issues/8351
- changing code to ensure forward-progress once cache slots have been
acquired
  - https://github.com/neondatabase/neon/pull/5480
  - https://github.com/neondatabase/neon/pull/5482
  - tbd: https://github.com/neondatabase/neon/issues/6065
- reducing the amount of runnable tokio tasks
  - https://github.com/neondatabase/neon/pull/5578
  - https://github.com/neondatabase/neon/pull/6000
- fix bugs that caused unnecessary concurrency induced by connection
handlers
  - https://github.com/neondatabase/neon/issues/5993

I manually verified that this PR doesn't negatively affect startup
performance as follows:
create a pageserver in production configuration, with 20k
tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

```
INFO Startup complete (368.009s since start) elapsed_ms=368009
```

I further verified in that same setup that, when using `pagebench`'s
getpage benchmark at as-fast-as-possible request rate against 5k of the
20k tenants, the achieved throughput is identical. The VirtualFile cache
isn't thrashing in that case.

Future Work
===========

We will still exposed to the cache thrashing risk from outside factors,
e.g., request concurrency is unbounded, and initial size calculation
skips the concurrency limiter when we establish a walreceiver
connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter
a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:
1. we have dashboards & metrics to monitor & alert on cache thrashing
2. we can react by scaling the bottleneck resources (cache size) or by
manually shedding load through tenant relocation

Potential systematic solutions are future work:
* global concurrency limiting
* per-tenant rate limiting => #5899
* pageserver-initiated load shedding

Related Issues
==============

This PR unblocks the introduction of tokio-epoll-uring for asynchronous
disk IO ([Epic](#4744)).

2024-01-11 11:29:14 +01:00

.cargo

build: back to opt-level=0 in debug builds, for faster compile times (#5751 )

2023-11-20 15:41:37 +01:00

.config

Use nextest for rust unittests (#6223 )

2023-12-30 13:45:31 +00:00

.github

test_runner: replace black with ruff format (#6268 )

2024-01-05 15:35:07 +00:00

compute_tools

Collapse multiline queries in compute_ctl (#6316 )

2024-01-10 22:25:28 +04:00

control_plane

pageserver: implement secondary-mode downloads (#6123 )

2024-01-05 12:29:20 +00:00

docker-compose

Logical replication (#5271 )

2023-10-18 16:42:22 +03:00

docs

RFC: vectored Timeline::get (#6250 )

2024-01-08 15:00:01 +00:00

libs

pagebench: fixup after is_rel_block_key changes in #6266 (#6303 )

2024-01-09 19:00:37 +01:00

pageserver

Revert "revert recent VirtualFile asyncification changes (#5291 )" (#6309 )

2024-01-11 11:29:14 +01:00

pgxn

Fix minimum backoff to 1ms

2024-01-03 21:09:19 -08:00

proxy

Added auth info cache with notifiations to redis. (#6208 )

2024-01-10 11:51:05 +00:00

s3_scrubber

s3_scrubber: updates for sharding (#6281 )

2024-01-08 09:19:10 +00:00

safekeeper

Add API for safekeeper timeline copy (#6091 )

2024-01-04 17:40:38 +00:00

scripts

test_runner: replace black with ruff format (#6268 )

2024-01-05 15:35:07 +00:00

storage_broker

Support custom types in broker (#5761 )

2023-12-19 17:06:43 +00:00

test_runner

pageserver: cleanup redundant create/attach code, fix detach while attaching (#6277 )

2024-01-09 10:37:54 +00:00

trace

Update most of the dependencies to their latest versions (#4026 )

2023-04-14 18:28:54 +03:00

vendor

Bump postgres submodule versions

2023-12-27 08:39:00 -08:00

workspace_hack

proxy: add request context for observability and blocking (#6160 )

2024-01-08 11:42:43 +00:00

.dockerignore

Feat/postgres 16 (#4761 )

2023-09-12 15:11:32 +02:00

.git-blame-ignore-revs

Add .git-blame-ignore-revs file (#2318 )

2022-08-22 16:38:31 +01:00

.gitignore

Build dockerfile from neon repo (#6195 )

2023-12-21 12:46:51 +00:00

.gitmodules

Feat/postgres 16 (#4761 )

2023-09-12 15:11:32 +02:00

.neon_clippy_args

build: run clippy for powerset of features (#4077 )

2023-04-27 15:01:27 +03:00

Cargo.lock

Added auth info cache with notifiations to redis. (#6208 )

2024-01-10 11:51:05 +00:00

Cargo.toml

Added auth info cache with notifiations to redis. (#6208 )

2024-01-10 11:51:05 +00:00

clippy.toml

Disallow block_in_place and Handle::block_on (#5101 )

2023-09-12 00:11:16 +00:00

CODEOWNERS

Update CODEOWNERS (#5421 )

2023-09-28 17:34:51 +01:00

CONTRIBUTING.md

Build dockerfile from neon repo (#6195 )

2023-12-21 12:46:51 +00:00

deny.toml

Manage pgbouncer configuration from compute_ctl:

2023-12-26 15:17:09 +00:00

Dockerfile

Build dockerfile from neon repo (#6195 )

2023-12-21 12:46:51 +00:00

Dockerfile.buildtools

Update Rust to 1.75.0 (#6285 )

2024-01-08 11:46:16 +01:00

Dockerfile.compute-node

Pg stat statements reset for neon superuser (#6232 )

2023-12-27 18:15:17 +01:00

Dockerfile.compute-tools

Build dockerfile from neon repo (#6195 )

2023-12-21 12:46:51 +00:00

LICENSE

Add LICENSE and COPYRIGHT files.

2021-05-27 15:33:08 +03:00

Makefile

Make targets to run pgindent on core and neon extension.

2023-12-08 14:03:13 +04:00

NOTICE

Remove specific file references in NOTICE

2023-10-18 14:58:48 -05:00

poetry.lock

test_runner: replace black with ruff format (#6268 )

2024-01-05 15:35:07 +00:00

pre-commit.py

test_runner: replace black with ruff format (#6268 )

2024-01-05 15:35:07 +00:00

pyproject.toml

test_runner: replace black with ruff format (#6268 )

2024-01-05 15:35:07 +00:00

pytest.ini

Improve pytest ergonomics

2022-10-04 14:53:01 +03:00

README.md

tests: update python dependencies (#6164 )

2023-12-18 15:47:09 +00:00

run_clippy.sh

build: run clippy for powerset of features (#4077 )

2023-04-27 15:01:27 +03:00

rust-toolchain.toml

Update Rust to 1.75.0 (#6285 )

2024-01-08 11:46:16 +01:00

vm-image-spec.yaml

vm-image-spec: build pgbouncer from Neon's fork (#6249 )

2024-01-03 13:02:04 +00:00

README.md

Neon

Neon is a serverless open-source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.

Quick start

Try the Neon Free Tier to create a serverless Postgres instance. Then connect to it with your preferred Postgres client (psql, dbeaver, etc) or use the online SQL Editor. See Connect from any application for connection instructions.

Alternatively, compile and run the project locally.

Architecture overview

A Neon installation consists of compute nodes and the Neon storage engine. Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.

The Neon storage engine consists of two major components:

Pageserver. Scalable storage backend for the compute nodes.
Safekeepers. The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.

See developer documentation in SUMMARY.md for more information.

Running local installation

Installing dependencies on Linux

Install build dependencies and other applicable packages

On Ubuntu or Debian, this set of packages should be sufficient to build the code:

apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python3-poetry lsof libicu-dev

On Fedora, these packages are needed:

dnf install flex bison readline-devel zlib-devel openssl-devel \
  libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
  protobuf-devel libcurl-devel openssl poetry lsof libicu-devel libpq-devel python3-devel \
  libffi-devel

On Arch based systems, these packages are needed:

pacman -S base-devel readline zlib libseccomp openssl clang \
postgresql-libs cmake postgresql protobuf curl lsof

Building Neon requires 3.15+ version of protoc (protobuf-compiler). If your distribution provides an older version, you can install a newer version from here.

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing dependencies on macOS (12.3.1)

Install XCode and dependencies

xcode-select --install
brew install protobuf openssl flex bison icu4c pkg-config

# add openssl to PATH, required for ed25519 keys generation in neon_local
echo 'export PATH="$(brew --prefix openssl)/bin:$PATH"' >> ~/.zshrc

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install PostgreSQL Client

# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq

Rustc version

The project uses rust toolchain file to define the version it's built with in CI for testing and local builds.

This file is automatically picked up by rustup that installs (if absent) and uses the toolchain version pinned in the file.

rustup users who want to build with another toolchain can use rustup override command to set a specific toolchain for the project's directory.

non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify their toolchain matches the version in the file. Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.

Building on Linux

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`nproc` -s"
# Remove -s for the verbose build log

make -j`nproc` -s

Building on OSX

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`sysctl -n hw.logicalcpu` -s"
# Remove -s for the verbose build log

make -j`sysctl -n hw.logicalcpu` -s

Dependency installation notes

To run the psql client, install the postgresql-client package or modify PATH and LD_LIBRARY_PATH to include pg_install/bin and pg_install/lib, respectively.

To run the integration tests or Python scripts (not required to use the code), install Python (3.9 or higher), and install python3 packages using ./scripts/pysync (requires poetry>=1.3) in the project directory.

Running neon database

Start pageserver and postgres on top of it (should be called from repo root):

# Create repository in .neon with proper paths to binaries and data
# Later that would be responsibility of a package install script
> cargo neon init
Initializing pageserver node 1 at '127.0.0.1:64000' in ".neon"

# start pageserver, safekeeper, and broker for their intercommunication
> cargo neon start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2918372
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon".
pageserver started, pid: 2918386
Starting safekeeper at '127.0.0.1:5454' in '.neon/safekeepers/sk1'.
safekeeper 1 started, pid: 2918437

# create initial tenant and use it as a default for every future neon_local invocation
> cargo neon tenant create --set-default
tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one

# create postgres compute node
> cargo neon endpoint create main

# start postgres compute node
> cargo neon endpoint start main
Starting new endpoint main (PostgreSQL v14) on timeline de200bd42b49cc1814412c7e592dd6e9 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55432/postgres'

# check list of running postgres instances
> cargo neon endpoint list
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS
 main      127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main         0/16B5BA8  running

Now, it is possible to connect to postgres and run some queries:

> psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
INSERT 0 1
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

And create branches and run postgres on them:

# create branch named migration_check
> cargo neon timeline branch --branch-name migration_check
Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c. Ancestor timeline: 'main'

# check branches tree
> cargo neon timeline list
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]

# create postgres on that branch
> cargo neon endpoint create migration_check --branch-name migration_check

# start postgres on that branch
> cargo neon endpoint start migration_check
Starting new endpoint migration_check (PostgreSQL v14) on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55434/postgres'

# check the new list of running postgres instances
> cargo neon endpoint list
 ENDPOINT         ADDRESS          TIMELINE                          BRANCH NAME      LSN        STATUS
 main             127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main             0/16F9A38  running
 migration_check  127.0.0.1:55434  b3b863fa45fa9e57e615f9f2d944e601  migration_check  0/16F9A70  running

# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p55434 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

postgres=# insert into t values(2,2);
INSERT 0 1

# check that the new change doesn't affect the 'main' postgres
> psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

If you want to run tests afterward (see below), you must stop all the running of the pageserver, safekeeper, and postgres instances you have just started. You can terminate them all with one command:

> cargo neon stop

Running tests

Ensure your dependencies are installed as described here.

git clone --recursive https://github.com/neondatabase/neon.git

CARGO_BUILD_FLAGS="--features=testing" make

./scripts/pytest

By default, this runs both debug and release modes, and all supported postgres versions. When testing locally, it is convenient to run just run one set of permutations, like this:

DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest

Documentation

docs Contains a top-level overview of all available markdown documentation.

sourcetree.md contains overview of source tree layout.

To view your rustdoc documentation in a browser, try running cargo doc --no-deps --open

See also README files in some source directories, and rustdoc style documentation comments.

Other resources:

SELECT 'Hello, World': Blog post by Nikita Shamgunov on the high level architecture
Architecture decisions in Neon: Blog post by Heikki Linnakangas
Neon: Serverless PostgreSQL!: Presentation on storage system by Heikki Linnakangas in the CMU Database Group seminar series

Postgres-specific terms

Due to Neon's very close relation with PostgreSQL internals, numerous specific terms are used. The same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.

To get more familiar with this aspect, refer to:

Neon glossary
PostgreSQL glossary
Other PostgreSQL documentation and sources (Neon fork sources can be found here)

Join the development

Read CONTRIBUTING.md to learn about project code style and practices.
To get familiar with a source tree layout, use sourcetree.md.
To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html

Languages

Rust 73.5%

Python 19.4%

C 5.2%

Dockerfile 0.8%

Shell 0.3%

Other 0.8%