The caller is now responsible for lookin up the predecessor layer,
instead. This makes the code simpler, as you don't need to update the
predecessor reference when a layer is frozen or written to disk.
There was a bug in that, as Konstantin noted on discord:
Assume that freeze doesn't create new inmem layer
(maybe_new_open=None). Then we temporary place in historics frozen
layer. Assume that now new put_wal_record request arrives. There is
no open in-mem layer, so it has to create new one. It is looking for
previous layer for read and set it as new in-mem layer
predecessor. But as far as I understand, prev layer should be our
temporary frozen layer. Which will be then removed from
historics.
That leaves the predecessor field of the new in-memory layer pointing
at the frozen in-memory layer that has been removed from the layer map,
preventing it from being removed from memory.
This makes two subtle changes:
1. When the first new layer is created on a branch for a segment that
existed on the ancestor branch, the start_lsn of the new layer is now
the branch point + 1. We were previously slightly confused on what
the branch point LSN meant. It means that all the WAL up to and
*including* the LSN on the old branch is visible to the new branch.
If we mark the start LSN of the new layer as equal to the branch point,
that's wrong, because if there is a WAL record with that LSN on the
predecessor layer, the new layer would hide it. This bug was hidden
when the layer on the new branch contained a direct reference to the
layer in the old branch, as get_page_reconstruct_data() followed that
reference directly when it didn't find the page version in the new
layer. But now that the caller performs the lookup, it will look up
the new layer that doesn't contain the record, and you get an error.
2. InMemoryLayer now always stores the segment size at the beginning
of the layer's LSN range. Previously, get_seg_size() might have
recursed into the predecessor layer to get the size, but now we
avoid that by always copying over the last size from the previous
layer, when a new layer is created.
Zenith
Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes
Architecture overview
A Zenith installation consists of Compute nodes and Storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by zenith storage.
Zenith storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
- WAL service. The service that receives WAL from compute node and ensures that it is stored durably.
Pageserver consists of:
- Repository - Zenith storage implementation.
- WAL receiver - service that receives WAL from WAL service and stores it in the repository.
- Page service - service that communicates with compute nodes and responds with pages from the repository.
- WAL redo - service that builds pages from base images and WAL records on Page service request.
Running local installation
- Install build dependencies and other useful packages
On Ubuntu or Debian this set of packages should be sufficient to build the code:
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev
[Rust] 1.52 or later is also required.
To run the psql client, install the postgresql-client package or modify PATH and LD_LIBRARY_PATH to include tmp_install/bin and tmp_install/lib, respectively.
To run the integration tests (not required to use the code), install
Python (3.6 or higher), and install python3 packages with pipenv using pipenv install in the project directory.
- Build zenith and patched postgres
git clone --recursive https://github.com/zenithdb/zenith.git
cd zenith
make -j5
- Start pageserver and postgres on top of it (should be called from repo root):
# Create repository in .zenith with proper paths to binaries and data
# Later that would be responsibility of a package install script
> ./target/debug/zenith init
pageserver init succeeded
# start pageserver
> ./target/debug/zenith start
Starting pageserver at '127.0.0.1:64000' in .zenith
Pageserver started
# start postgres on top on the pageserver
> ./target/debug/zenith pg start main
Starting postgres node at 'host=127.0.0.1 port=55432 user=stas'
waiting for server to start.... done
# check list of running postgres instances
> ./target/debug/zenith pg list
BRANCH ADDRESS LSN STATUS
main 127.0.0.1:55432 0/1609610 running
- Now it is possible to connect to postgres and run some queries:
> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
INSERT 0 1
postgres=# select * from t;
key | value
-----+-------
1 | 1
(1 row)
- And create branches and run postgres on them:
# create branch named migration_check
> ./target/debug/zenith branch migration_check main
Created branch 'migration_check' at 0/1609610
# check branches tree
> ./target/debug/zenith branch
main
┗━ @0/1609610: migration_check
# start postgres on that branch
> ./target/debug/zenith pg start migration_check
Starting postgres node at 'host=127.0.0.1 port=55433 user=stas'
waiting for server to start.... done
# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p55433 -h 127.0.0.1 -U zenith_admin postgres
postgres=# select * from t;
key | value
-----+-------
1 | 1
(1 row)
postgres=# insert into t values(2,2);
INSERT 0 1
- If you want to run tests afterwards (see below), you have to stop pageserver and all postgres instances you have just started:
> ./target/debug/zenith pg stop migration_check
> ./target/debug/zenith pg stop main
> ./target/debug/zenith stop
Running tests
git clone --recursive https://github.com/zenithdb/zenith.git
make # builds also postgres and installs it to ./tmp_install
cd test_runner
pytest
Documentation
Now we use README files to cover design ideas and overall architecture for each module and rustdoc style documentation comments. See also /docs/ a top-level overview of all available markdown documentation.
- /docs/sourcetree.md contains overview of source tree layout.
To view your rustdoc documentation in a browser, try running cargo doc --no-deps --open
Postgres-specific terms
Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used. Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
To get more familiar with this aspect, refer to:
- Zenith glossary
- PostgreSQL glossary
- Other PostgreSQL documentation and sources (Zenith fork sources can be found here)
Join the development
- Read
CONTRIBUTING.mdto learn about project code style and practices. - To get familiar with a source tree layout, use /docs/sourcetree.md.
- To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html