rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-05 12:32:54 +00:00

Go to file

Christian Schwarz 158db414bf buffered writer: handle write errors by retrying all write IO errors indefinitely (#10993 )

# Problem

If the Pageserver ingest path
(InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable
buffer
of the BufferedWriter, the buffered writer is left in a state where
subsequent _reads_
from the InMemoryLayer it will cause a `must not use after we returned
an error` panic.

The reason is that
1. the flush background task bails on flush failure, 
2. causing the `FlushHandle::flush` function to fail at channel.recv()
and
3. causing the `FlushHandle::flush` function to bail with the flush
error,
4. leaving its caller `BufferedWriter::flush` with
`BufferedWriter::mutable = None`,
5. once the InMemoryLayer's RwLock::write guard is dropped, subsequent
reads can enter,
6. those reads find `mutable = None` and cause the panic.

# Context

It has always been the contract that writes against the BufferedWriter
API
must not be retried because the writer/stream-style/append-only
interface makes no atomicity guarantees ("On error, did nothing or a
piece of the buffer get appended?").

The idea was that the error would bubble up to upper layers that can
throw away the buffered writer and create a new one. (See our [internal
error handling policy document on how to handle e.g.
`ENOSPC`](c870a50bc0/src/storage/handling_io_and_logical_errors.md (L36-L43))).

That _might_ be true for delta/image layer writers, I haven't checked.

But it's certainly not true for the ingest path: there are no provisions
to throw away an InMemoryLayer that encountered a write error an
reingest the WAL already written to it.

Adding such higher-level retries would involve either resetting
last_record_lsn to a lower value and restarting walreceiver. The code
isn't flexible enough to do that, and such complexity likely isn't worth
it given that write errors are rare.

# Solution

The solution in this PR is to retry _any_ failing write operation
_indefinitely_ inside the buffered writer flush task, except of course
those that are fatal as per `maybe_fatal_err`.

Retrying indefinitely ensures that `BufferedWriter::mutable` is never
left `None` in the case of IO errors, thereby solving the problem
described above.

It's a clear improvement over the status quo.

However, while we're retrying, we build up backpressure because the
`flush` is only double-buffered, not infinitely buffered.

Backpressure here is generally good to avoid resource exhaustion, **but
blocks reads** and hence stalls GetPage requests because InMemoryLayer
reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.

## Caveats

Note that there are some remaining conditions in the flush background
task where it can bail with an error. I have annotated one of them with
a TODO comment.
Hence the `FlushHandle::flush` is still fallible and hence the overall
scenario of leaving `mutable = None` on the bail path is still possible.
We can clean that up in a later commit.

Note also that retrying indefinitely is great for temporary errors like
ENOSPC but likely undesirable in case the `std::io::Error` we get is
really due to higher-level logic bugs.
For example, we could fail to flush because the timeline or tenant
directory got deleted and VirtualFile's reopen fails with ENOENT.

Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered
writer background task was to recv from flush op channel until the
sending side (FlushHandle) is explicitly shut down or dropped.

Failing to handle cancellation carries the operational risk that even if
a single timeline gets stuck because of a logic bug such as the one laid
out above, we must still restart the whole pageserver process.

# Alternatives Considered

As pointed out in the `Context` section, throwing away a InMemoryLayer
that encountered an error and reingesting the WAL is a lot of complexity
that IMO isn't justified for such an edge case.
Also, it's wasteful.
I think it's a local optimum.

A more general and simpler solution for ENOSPC is to `abort()` the
process and run eviction on startup before bringing up the rest of
pageserver.
I argued for it in the past, the pro arguments are still valid and
complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed
and all tenants have secondaries.
For example, if pageserver `abort()`s on ENOSPC and then simply don't
respond to storcon heartbeats while we're running eviction on startup,
storcon will fail tenants over to the secondary anyway, giving us all
the time we need to clean up.

The downside is that if there's a systemic space management bug, above
proposal will just propagate the problem to other nodes. But I imagine
that because of the delays involved with filling up disks, the system
might reach a half-stable state, providing operators more time to react.

# Demo

Intermediary commit `a03f335121480afc0171b0f34606bdf929e962c5` is demoed
in this (internal) screen recording:
https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link

# Perf Testing

Ran `bench_ingest` on tmpfs, no measurable difference.

Spans are uniquely owned by the flush task, and the span stack isn't too
deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not _that_
high frequency event anyways.

# Refs
- fixes https://github.com/neondatabase/neon/issues/10856

2025-03-11 20:40:23 +00:00

.cargo

Dockerfile: build with force-frame-pointers=yes (#10286 )

2025-01-06 20:17:43 +00:00

.config

chore(proxy): vendor a subset of rust-postgres (#9930 )

2024-11-29 11:08:01 +00:00

.github

storage broker: disable deploy by default (#11172 )

2025-03-11 19:45:06 +00:00

build_tools/patches

Patch pgcopydb and fix another segfault (#10706 )

2025-02-06 20:21:18 +00:00

compute

Spawn rsyslog from neonvm (#11111 )

2025-03-06 19:14:19 +00:00

compute_tools

Fetch remote extension in ALTER EXTENSION UPDATE statements (#11102 )

2025-03-09 17:29:44 +00:00

control_plane

storcon: timetime table, creation and deletion (#11058 )

2025-03-11 02:31:22 +00:00

docker-compose

Change the tags names according to the curent state (#11059 )

2025-03-03 09:40:49 +00:00

docs

rfc: add 041-rel-sparse-keyspace (#10412 )

2025-03-05 21:43:16 +00:00

libs

pageserver: add max_logical_size_per_shard for get_top_tenants (#11157 )

2025-03-11 11:43:55 +00:00

pageserver

buffered writer: handle write errors by retrying all write IO errors indefinitely (#10993 )

2025-03-11 20:40:23 +00:00

pgxn

walproposer: pre generations refactoring (#11060 )

2025-03-11 14:01:00 +00:00

proxy

fix(proxy): Add testodrome query id HTTP header (#11167 )

2025-03-11 17:17:30 +00:00

safekeeper

walproposer: pre generations refactoring (#11060 )

2025-03-11 14:01:00 +00:00

scripts

feat(ci): don't build storage on compute-releases and vice versa (#10841 )

2025-02-26 17:17:26 +00:00

storage_broker

Update storage components to edition 2024 (#10919 )

2025-02-25 23:51:37 +00:00

storage_controller

storcon: timetime table, creation and deletion (#11058 )

2025-03-11 02:31:22 +00:00

storage_scrubber

fix(scrubber): log even if no refs are found (#11160 )

2025-03-11 14:33:35 +00:00

test_runner

fix(test): force L0 compaction before gc-compaction (#11143 )

2025-03-10 20:03:49 +00:00

vendor

Fetch remote extension in ALTER EXTENSION UPDATE statements (#11102 )

2025-03-09 17:29:44 +00:00

workspace_hack

Appease cargo deny errors (#11142 )

2025-03-10 13:24:14 +00:00

.dockerignore

Use the Dockerfile COPY instead of docker cp (#10943 )

2025-02-25 12:44:06 +00:00

.git-blame-ignore-revs

Add .git-blame-ignore-revs file (#2318 )

2022-08-22 16:38:31 +01:00

.gitattributes

devx: nicer diff hunk headers (#8482 )

2024-07-24 16:50:49 +01:00

.gitignore

proxy: don't follow redirects for user provided JWKS urls + set custom user agent (#9514 )

2024-10-25 14:04:41 +02:00

.gitmodules

[compute/postgres] feature: PostgreSQL 17 (#8573 )

2024-09-12 23:18:41 +01:00

.neon_clippy_args

clippy-deny the todo!() macro (#4340 )

2024-06-25 18:03:27 +00:00

build-tools.Dockerfile

Update rust to 1.85.0 (#10914 )

2025-02-20 19:16:22 +00:00

Cargo.lock

storcon: timetime table, creation and deletion (#11058 )

2025-03-11 02:31:22 +00:00

Cargo.toml

pageserver: https for management API (#11025 )

2025-03-10 15:07:59 +00:00

clippy.toml

tokio-epoll-uring: retry on launch failures due to locked memory (#7141 )

2024-03-15 19:46:15 +00:00

CODEOWNERS

remove CODEOWNER assignement for the test_runner/ (#11130 )

2025-03-07 12:38:27 +00:00

CONTRIBUTING.md

CI: use build-tools image from dockerhub (#6795 )

2024-02-28 12:38:11 +00:00

deny.toml

Ignore cargo deny advisory RUSTSEC-2025-0014 for humantime (#11180 )

2025-03-11 19:09:32 +00:00

diesel.toml

Clean up 'attachment service' names to storage controller (#7326 )

2024-04-05 16:18:00 +01:00

Dockerfile

Integrate cargo-chef into Dockerfile (#10782 )

2025-02-13 13:08:46 +00:00

LICENSE

Add LICENSE and COPYRIGHT files.

2021-05-27 15:33:08 +03:00

Makefile

Add -fsigned-char for cross platform signed chars (#10852 )

2025-02-28 21:07:21 +00:00

NOTICE

Update copyright notice, set it to current year (#6671 )

2024-02-08 00:48:31 +01:00

poetry.lock

Update Jinja2 to 3.1.6 (#11109 )

2025-03-06 09:55:41 +00:00

pre-commit.py

pre-commit: Switch to cargo fmt to handle per-crate editions (#10969 )

2025-02-25 12:29:27 +00:00

pyproject.toml

Update Jinja2 to 3.1.6 (#11109 )

2025-03-06 09:55:41 +00:00

pytest.ini

Fix format of milliseconds in pytest output (#10836 )

2025-02-16 04:59:52 +00:00

README.md

README: clarify that neon_local is a dev/test tool (#10512 )

2025-01-27 17:24:42 +00:00

run_clippy.sh

build: run clippy for powerset of features (#4077 )

2023-04-27 15:01:27 +03:00

rust-toolchain.toml

Update rust to 1.85.0 (#10914 )

2025-02-20 19:16:22 +00:00

README.md

Neon

Neon is a serverless open-source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.

Quick start

Try the Neon Free Tier to create a serverless Postgres instance. Then connect to it with your preferred Postgres client (psql, dbeaver, etc) or use the online SQL Editor. See Connect from any application for connection instructions.

Alternatively, compile and run the project locally.

Architecture overview

A Neon installation consists of compute nodes and the Neon storage engine. Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.

The Neon storage engine consists of two major components:

Pageserver: Scalable storage backend for the compute nodes.
Safekeepers: The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.

See developer documentation in SUMMARY.md for more information.

Running a local development environment

Neon can be run on a workstation for small experiments and to test code changes, by following these instructions.

Installing dependencies on Linux

Install build dependencies and other applicable packages

On Ubuntu or Debian, this set of packages should be sufficient to build the code:

apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libprotobuf-dev libcurl4-openssl-dev openssl python3-poetry lsof libicu-dev

On Fedora, these packages are needed:

dnf install flex bison readline-devel zlib-devel openssl-devel \
  libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
  protobuf-devel libcurl-devel openssl poetry lsof libicu-devel libpq-devel python3-devel \
  libffi-devel

On Arch based systems, these packages are needed:

pacman -S base-devel readline zlib libseccomp openssl clang \
postgresql-libs cmake postgresql protobuf curl lsof

Building Neon requires 3.15+ version of protoc (protobuf-compiler). If your distribution provides an older version, you can install a newer version from here.

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing dependencies on macOS (12.3.1)

Install XCode and dependencies

xcode-select --install
brew install protobuf openssl flex bison icu4c pkg-config m4

# add openssl to PATH, required for ed25519 keys generation in neon_local
echo 'export PATH="$(brew --prefix openssl)/bin:$PATH"' >> ~/.zshrc

If you get errors about missing m4 you may have to install it manually:

brew install m4
brew link --force m4

Install Rust

# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install PostgreSQL Client

# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq

Rustc version

The project uses rust toolchain file to define the version it's built with in CI for testing and local builds.

This file is automatically picked up by rustup that installs (if absent) and uses the toolchain version pinned in the file.

rustup users who want to build with another toolchain can use the rustup override command to set a specific toolchain for the project's directory.

non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify that their toolchain matches the version in the file. Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.

Building on Linux

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`nproc` -s"
# Remove -s for the verbose build log

make -j`nproc` -s

Building on OSX

Build neon and patched postgres

# Note: The path to the neon sources can not contain a space.

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`sysctl -n hw.logicalcpu` -s"
# Remove -s for the verbose build log

make -j`sysctl -n hw.logicalcpu` -s

Dependency installation notes

To run the psql client, install the postgresql-client package or modify PATH and LD_LIBRARY_PATH to include pg_install/bin and pg_install/lib, respectively.

To run the integration tests or Python scripts (not required to use the code), install Python (3.11 or higher), and install the python3 packages using ./scripts/pysync (requires poetry>=1.8) in the project directory.

Running neon database

Start pageserver and postgres on top of it (should be called from repo root):

# Create repository in .neon with proper paths to binaries and data
# Later that would be responsibility of a package install script
> cargo neon init
Initializing pageserver node 1 at '127.0.0.1:64000' in ".neon"

# start pageserver, safekeeper, and broker for their intercommunication
> cargo neon start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2918372
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon".
pageserver started, pid: 2918386
Starting safekeeper at '127.0.0.1:5454' in '.neon/safekeepers/sk1'.
safekeeper 1 started, pid: 2918437

# create initial tenant and use it as a default for every future neon_local invocation
> cargo neon tenant create --set-default
tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one

# create postgres compute node
> cargo neon endpoint create main

# start postgres compute node
> cargo neon endpoint start main
Starting new endpoint main (PostgreSQL v14) on timeline de200bd42b49cc1814412c7e592dd6e9 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55432/postgres'

# check list of running postgres instances
> cargo neon endpoint list
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS
 main      127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main         0/16B5BA8  running

Now, it is possible to connect to postgres and run some queries:

> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
INSERT 0 1
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

And create branches and run postgres on them:

# create branch named migration_check
> cargo neon timeline branch --branch-name migration_check
Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c. Ancestor timeline: 'main'

# check branches tree
> cargo neon timeline list
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]

# create postgres on that branch
> cargo neon endpoint create migration_check --branch-name migration_check

# start postgres on that branch
> cargo neon endpoint start migration_check
Starting new endpoint migration_check (PostgreSQL v14) on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55434/postgres'

# check the new list of running postgres instances
> cargo neon endpoint list
 ENDPOINT         ADDRESS          TIMELINE                          BRANCH NAME      LSN        STATUS
 main             127.0.0.1:55432  de200bd42b49cc1814412c7e592dd6e9  main             0/16F9A38  running
 migration_check  127.0.0.1:55434  b3b863fa45fa9e57e615f9f2d944e601  migration_check  0/16F9A70  running

# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p 55434 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

postgres=# insert into t values(2,2);
INSERT 0 1

# check that the new change doesn't affect the 'main' postgres
> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

If you want to run tests afterwards (see below), you must stop all the running pageserver, safekeeper, and postgres instances you have just started. You can terminate them all with one command:

> cargo neon stop

More advanced usages can be found at Local Development Control Plane (neon_local)).

Handling build failures

If you encounter errors during setting up the initial tenant, it's best to stop everything (cargo neon stop) and remove the .neon directory. Then fix the problems, and start the setup again.

Running tests

Rust unit tests

We are using cargo-nextest to run the tests in Github Workflows. Some crates do not support running plain cargo test anymore, prefer cargo nextest run instead. You can install cargo-nextest with cargo install cargo-nextest.

Integration tests

Ensure your dependencies are installed as described here.

git clone --recursive https://github.com/neondatabase/neon.git

CARGO_BUILD_FLAGS="--features=testing" make

./scripts/pytest

By default, this runs both debug and release modes, and all supported postgres versions. When testing locally, it is convenient to run just one set of permutations, like this:

DEFAULT_PG_VERSION=16 BUILD_TYPE=release ./scripts/pytest

Flamegraphs

You may find yourself in need of flamegraphs for software in this repository. You can use flamegraph-rs or the original flamegraph.pl. Your choice!

Important

If you're using lld or mold, you need the --no-rosegment linker argument. It's a general thing with Rust / lld / mold, not specific to this repository. See this PR for further instructions.

Cleanup

For cleaning up the source tree from build artifacts, run make clean in the source directory.

For removing every artifact from build and configure steps, run make distclean, and also consider removing the cargo binaries in the target directory, as well as the database in the .neon directory. Note that removing the .neon directory will remove your database, with all data in it. You have been warned!

Documentation

docs Contains a top-level overview of all available markdown documentation.

sourcetree.md contains overview of source tree layout.

To view your rustdoc documentation in a browser, try running cargo doc --no-deps --open

See also README files in some source directories, and rustdoc style documentation comments.

Other resources:

SELECT 'Hello, World': Blog post by Nikita Shamgunov on the high level architecture
Architecture decisions in Neon: Blog post by Heikki Linnakangas
Neon: Serverless PostgreSQL!: Presentation on storage system by Heikki Linnakangas in the CMU Database Group seminar series

Postgres-specific terms

Due to Neon's very close relation with PostgreSQL internals, numerous specific terms are used. The same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.

To get more familiar with this aspect, refer to:

Neon glossary
PostgreSQL glossary
Other PostgreSQL documentation and sources (Neon fork sources can be found here)

Join the development

Read CONTRIBUTING.md to learn about project code style and practices.
To get familiar with a source tree layout, use sourcetree.md.
To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html

Languages

Rust 73.5%

Python 19.4%

C 5.2%

Dockerfile 0.8%

Shell 0.3%

Other 0.8%