rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-14 17:02:56 +00:00

Go to file

Heikki Linnakangas 07342f7519 Major storage format rewrite.

This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.

Mapping from Postgres data directory to keys/values
---------------------------------------------------

All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.

The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.

'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.

The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.

Changes to low-level layer file code
------------------------------------

The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.

Checkpointing, compaction
-------------------------

The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.

Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.

Deployment
----------

This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.

Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>

2022-03-28 05:41:15 -05:00

.circleci

Major storage format rewrite.

2022-03-28 05:41:15 -05:00

.github/workflows

Run GitHub testing workflow on every push

2022-03-21 12:46:33 +02:00

compute_tools

Fix 1.59 rustc clippy warnings

2022-03-02 21:35:34 +02:00

control_plane

Better error messages on zenith cli subcommand invocations

2022-03-25 11:58:54 +02:00

docs

Major storage format rewrite.

2022-03-28 05:41:15 -05:00

monitoring

Add some prometheus metrics to pageserver

2021-08-03 21:42:24 +03:00

pageserver

Major storage format rewrite.

2022-03-28 05:41:15 -05:00

postgres_ffi

Major storage format rewrite.

2022-03-28 05:41:15 -05:00

proxy

Remove the last non-borrowed string from BeMessage (#1376 )

2022-03-17 16:46:58 +03:00

scripts

merge directories in git-upload instead of removing existing files for perf test result uploads

2022-02-15 03:47:06 +03:00

test_runner

Major storage format rewrite.

2022-03-28 05:41:15 -05:00

vendor

Major storage format rewrite.

2022-03-28 05:41:15 -05:00

walkeeper

Adjust safekeeper detailed logging to batch fsyncing.

2022-03-08 08:07:00 +03:00

workspace_hack

use 2021 rust edition

2022-01-25 18:48:49 +03:00

zenith

Use serde_with to (de)serialize ZId and Lsn to hex

2022-03-21 12:46:07 +02:00

zenith_metrics

Bump pageserver dependencies

2022-02-10 08:33:22 -05:00

zenith_utils

Use serde_with to (de)serialize ZId and Lsn to hex

2022-03-21 12:46:07 +02:00

.dockerignore

Docker build now also uses BUILD_TYPE=release. (#712 )

2021-10-06 23:42:00 +02:00

.gitignore

Add bespoke glue script leveraging LLVM coverage tools

2021-12-06 13:27:52 +03:00

.gitmodules

Replace old git urls with the current ones

2021-08-30 23:51:47 +03:00

.yapfignore

Check python for the whole repository and improve docs (#813 )

2021-11-09 22:23:29 +03:00

Cargo.lock

Major storage format rewrite.

2022-03-28 05:41:15 -05:00

Cargo.toml

[proxy] Migrate to async

2022-02-17 11:54:27 +03:00

cli-v2-story.md

Implement "timelines" in page server

2021-04-20 19:11:27 +03:00

CONTRIBUTING.md

Add CONTRIBUTING.md with some ground rules for submitting PRs.

2021-05-27 23:07:37 +03:00

Add LICENSE and COPYRIGHT files.

2021-05-27 15:33:08 +03:00

docker-entrypoint.sh

add node id to pageserver (#1310 )

2022-03-04 01:10:42 +03:00

Dockerfile

Use cachepot to cache more rustc builds

2022-03-26 08:38:45 +02:00

Dockerfile.alpine

Dockerfile: remove wal_acceptor alias for safekeeper (#743 )

2021-10-18 14:56:30 +03:00

Dockerfile.build

revert docker build to debian:buster based rust (#1347 )

2022-03-09 10:13:46 +03:00

Dockerfile.compute-tools

Use cachepot to cache more rustc builds

2022-03-26 08:38:45 +02:00

LICENSE

Add LICENSE and COPYRIGHT files.

2021-05-27 15:33:08 +03:00

Makefile

Speed up builds by passing make jobserver to cargo

2021-10-12 21:02:39 +03:00

poetry.lock

[proxy] Add pytest fixture (#1311 )

2022-02-24 11:20:07 -05:00

pre-commit.py

Switch our python package management solution to poetry.

2022-01-24 11:33:47 +03:00

pyproject.toml

[proxy] Add pytest fixture (#1311 )

2022-02-24 11:20:07 -05:00

pytest.ini

Switch our python package management solution to poetry.

2022-01-24 11:33:47 +03:00

README.md

Tidy up pageserver's endpoints

2022-03-10 19:38:58 +02:00

run_clippy.sh

Fix clippy lints and enable clippy checking in CI

2021-09-16 15:09:16 +03:00

setup.cfg

Check python for the whole repository and improve docs (#813 )

2021-11-09 22:23:29 +03:00

README.md

Zenith

Zenith is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.

Architecture overview

A Zenith installation consists of compute nodes and Zenith storage engine.

Compute nodes are stateless PostgreSQL nodes, backed by Zenith storage engine.

Zenith storage engine consists of two major components:

Pageserver. Scalable storage backend for compute nodes.
WAL service. The service that receives WAL from compute node and ensures that it is stored durably.

Pageserver consists of:

Repository - Zenith storage implementation.
WAL receiver - service that receives WAL from WAL service and stores it in the repository.
Page service - service that communicates with compute nodes and responds with pages from the repository.
WAL redo - service that builds pages from base images and WAL records on Page service request.

Running local installation

Install build dependencies and other useful packages

On Ubuntu or Debian this set of packages should be sufficient to build the code:

apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev

[Rust] 1.56.1 or later is also required.

To run the psql client, install the postgresql-client package or modify PATH and LD_LIBRARY_PATH to include tmp_install/bin and tmp_install/lib, respectively.

To run the integration tests or Python scripts (not required to use the code), install Python (3.7 or higher), and install python3 packages using ./scripts/pysync (requires poetry) in the project directory.

Build zenith and patched postgres

git clone --recursive https://github.com/zenithdb/zenith.git
cd zenith
make -j5

Start pageserver and postgres on top of it (should be called from repo root):

# Create repository in .zenith with proper paths to binaries and data
# Later that would be responsibility of a package install script
> ./target/debug/zenith init
initializing tenantid c03ba6b7ad4c5e9cf556f059ade44229
created initial timeline 5b014a9e41b4b63ce1a1febc04503636 timeline.lsn 0/169C3C8
created main branch
pageserver init succeeded

# start pageserver and safekeeper
> ./target/debug/zenith start
Starting pageserver at 'localhost:64000' in '.zenith'
Pageserver started
initializing for single for 7676
Starting safekeeper at '127.0.0.1:5454' in '.zenith/safekeepers/single'
Safekeeper started

# start postgres compute node
> ./target/debug/zenith pg start main
Starting new postgres main on timeline 5b014a9e41b4b63ce1a1febc04503636 ...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/c03ba6b7ad4c5e9cf556f059ade44229/main port=55432
Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres'
waiting for server to start.... done
server started

# check list of running postgres instances
> ./target/debug/zenith pg list
NODE	ADDRESS	TIMELINES	BRANCH NAME	LSN		STATUS
main	127.0.0.1:55432	5b014a9e41b4b63ce1a1febc04503636	main	0/1609610	running

Now it is possible to connect to postgres and run some queries:

> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
INSERT 0 1
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

And create branches and run postgres on them:

# create branch named migration_check
> ./target/debug/zenith timeline branch --branch-name migration_check
Created timeline '0e9331cad6efbafe6a88dd73ae21a5c9' at Lsn 0/16F5830 for tenant: c03ba6b7ad4c5e9cf556f059ade44229. Ancestor timeline: 'main'

# check branches tree
> ./target/debug/zenith timeline list
 main [5b014a9e41b4b63ce1a1febc04503636]
 ┗━ @0/1609610: migration_check [0e9331cad6efbafe6a88dd73ae21a5c9]

# start postgres on that branch
> ./target/debug/zenith pg start migration_check
Starting postgres node at 'host=127.0.0.1 port=55433 user=stas'
waiting for server to start.... done

# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p55433 -h 127.0.0.1 -U zenith_admin postgres
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

postgres=# insert into t values(2,2);
INSERT 0 1

If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances you have just started. You can stop them all with one command:

> ./target/debug/zenith stop

Running tests

git clone --recursive https://github.com/zenithdb/zenith.git
make # builds also postgres and installs it to ./tmp_install
./scripts/pytest

Documentation

Now we use README files to cover design ideas and overall architecture for each module and rustdoc style documentation comments. See also /docs/ a top-level overview of all available markdown documentation.

/docs/sourcetree.md contains overview of source tree layout.

To view your rustdoc documentation in a browser, try running cargo doc --no-deps --open

Postgres-specific terms

Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used. Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.

To get more familiar with this aspect, refer to:

Zenith glossary
PostgreSQL glossary
Other PostgreSQL documentation and sources (Zenith fork sources can be found here)

Join the development

Read CONTRIBUTING.md to learn about project code style and practices.
To get familiar with a source tree layout, use /docs/sourcetree.md.
To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html

Languages

Rust 73.5%

Python 19.4%

C 5.2%

Dockerfile 0.8%

Shell 0.3%

Other 0.8%