Fixes#1628
- add [`comfy_table`](https://github.com/Nukesor/comfy-table/tree/main) and use it to construct table for `pg list` CLI command
Comparison
- Old:
```
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
```
- New:
```
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
```
Make it remember when timeline starts in general and on this safekeeper in
particular (the point might be later on new safekeeper replacing failed one).
Bumps control file and walproposer protocol versions.
While protocol is bumped, also add safekeeper node id to
AcceptorProposerGreeting.
ref #1561
Now princeple is following: acceptor threads (libpq and http) error will
bring the pageserver down, but all per-tenant thread failures will be treated
as an error.
A new `get_lsn_by_timestamp` command is added to the libpq page service
API.
An extra timestamp field is now stored in an extra field after each
Clog page. It is the timestamp of the latest commit, among all the
transactions on the Clog page. To find the overall latest commit, we
need to scan all Clog pages, but this isn't a very frequent operation
so that's not too bad.
To find the LSN that corresponds to a timestamp, we perform a binary
search. The binary search starts with min = last LSN when GC ran, and
max = latest LSN on the timeline. On each iteration of the search we
check if there are any commits with a higher-than-requested timestamp
at that LSN.
Implements github issue 1361.
* Traverse frozen layer in get_reconstruct_data in reverse order
* Fix comments on frozen layers.
Note explicitly the order that the layers are in the queue.
* Add fail point to reproduce failpoint iteration error
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Now proxy binary accepts `--auth-backend` CLI option, which determines
auth scheme and cluster routing method. Following backends are currently
implemented:
* legacy
old method, when username ends with `@zenith` it uses md5 auth dbname as
the cluster name; otherwise, it sends a login link and waits for the console
to call back
* console
new SCRAM-based console API; uses SNI info to select the destination
cluster
* postgres
uses postgres to select auth secrets of existing roles. Useful for local
testing
* link
sends login link for all usernames
* `cloud::legacy` talks to Cloud API V1.
* `cloud::api` defines Cloud API v2.
* `cloud::local` mocks the Cloud API V2 using a local postgres instance.
* It's possible to choose between API versions using the `--api-version` flag.
This is needed to forward the `ClientKey` that's required
to connect the proxy to a compute.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Commit edba2e97 renamed pageserver/README to pageserver/README.md, but
forgot to update links to it. Fix.
Rename libs/postgres_ffi/README and safekeeper/README files to also
have the the .md extension, so that github can render them nicely.
Quote ascii-diagram in safekeeper/README.md so that it renders
correctly.
wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.
When failpoint feature is disabled it throws away passed code so code
inside is not guaranteed to compile when feature is disabled. In this
particular case code is obsolete so removing it.
Remove when it is consumed by all of 1) pageserver (remote_consistent_lsn) 2)
safekeeper peers 3) s3 WAL offloading.
In test s3 offloading for now is mocked by directly bumping s3_wal_lsn.
ref #1403
Previously we've used table interface, but there was no easy way to pass
it as an override to pageserver through cli. Use the same strategy as
for remote storage config parsing
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters
Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console
Define PiTR interval for GC.
refer #1320
This depends on a hacked version of the 'pprof-rs' crate. Because of
that, it's under an optional 'profiling' feature. It is disabled by
default, but enabled for release builds in CircleCI config. It doesn't
currently work on macOS.
The flamegraph is written to 'flamegraph.svg' in the pageserver
workdir when the 'pageserver' process exits.
Add a performance test that runs the perf_pgbench test, with profiling
enabled.
We only checked the cache page version when collecting WAL records in
an in-memory layer, not in a delta layer. Refactor the code so that we
always stop collecting WAL records when we reach a cached materialized
page.
Fix the assertion on the LSN range in
InMemoryLayer::get_value_reconstruct_data. It was supposed to check
that the requested LSN range is within the layer's LSN range, but the
inequality was backwards. That went unnoticed before, because the
caller always passed the layer's start LSN as the requested LSN
range's start LSN, but now we might stop the search earlier, if we have
a cached page version.
Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Unlink failure isn't serious on its own, we were about to remove the
file anyway, but it shouldn't happen and could be a symptom of
something more serious.
We just saw "No such file or directory" errors happening from
ephemeral file writeback in staging, and I suspect if we had this
warning in place, we would have seen these warnings too, if the
problem was that the ephemeral file was removed before dropping the
EphemeralFile struct. Next time it happens, we'll have more
information.
With this, we no longer need to build two versions of 'pem' and 'base64'
crates. Introduces a duplicate version of 'time' crate, though, but it's
still progress.
- Remove batch_others/test_pgbench.py. It was a quick check that pgbench
works, without actually recording any performance numbers, but that
doesn't seem very interesting anymore. Remove it to avoid confusing it
with the actual pgbench benchmarks
- Run pgbench with "-n" and "-S" options, for two different workloads:
simple-updates, and SELECT-only. Previously, we would only run it with
the "default" TPCB-like workload. That's more or less the same as the
simple-update (-n) workload, but I think the simple-upload workload
is more relevant for testing storage performance. The SELECT-only
workload is a new thing to measure.
- Merge test_perf_pgbench.py and test_perf_pgbench_remote.py. I added
a new "remote" implementation of the PgCompare class, which allows
running the same tests against an already-running Postgres instance.
- Make the PgBenchRunResult.parse_from_output function more
flexible. pgbench can print different lines depending on the
command-line options, but the parsing function expected a particular
set of lines.
The PgProtocol.connect() function took extra options for username,
database, etc. Remove those options, and have a generic way for each
subclass of PgProtocol to provide some default options, with the
capability override them in the connect() call.
It was the only test that used the 'schema' argument to the connect()
function. I'm about to refactor the option handling and will remove
the special 'schema' argument altogether, so rewrite the test to not
use it.
There's a lot of renaming left to do in the code and docs, but this is
a start. Our binaries and many other things are still called "zenith",
but I didn't change those in the README, because otherwise the
examples won't work. I added a brief note at the top of the README to
explain that we're in the process of renaming, until we've renamed
everything.
It originated from the fact that we were calling to fetch_full_index
without releasing the read guard, and fetch_full_index tries to acquire
read again. For plain mutex it is already a deeadlock, for RW lock
deadlock was achieved by an attempt to acquire write access later in the
code while still having active read guard up in the stack
This is sort of a bandaid because Kirill plans to change this code
during removal of an archiving mechanism
* [proxy] Add SCRAM auth
* [proxy] Implement some tests for SCRAM
* Refactoring + test fixes
* Hide SCRAM mechanism behind `#[cfg(test)]`
Currently we only use it in tests, so we hide all relevant
module behind `#[cfg(test)]` to prevent "unused item" warnings.
* Add test for restore from WAL
* Fix python formatting
* Choose unused port in wal restore test
* Move recovery tests to zenith_utils/scripts
* Set LD_LIBRARY_PATH in wal recovery scripts
* Fix python test formatting
* Fix mypy warning
* Bump postgres version
* Bump postgres version
We now use a page cache for those, instead of slurping the whole index into
memory.
Fixes https://github.com/zenithdb/zenith/issues/1356
This is a backwards-incompatible change to the storage format, so
bump STORAGE_FORMAT_VERSION.
This introduces two new abstraction layers for I/O:
- Block I/O, and
- Blob I/O.
The BlockReader trait abstracts a file or something else that can be read
in 8kB pages. It is implemented by EphemeralFiles, and by a new
FileBlockReader struct that allows reading arbitrary VirtualFiles in that
manner, utilizing the page cache.
There is also a new BlockCursor struct that works as a cursor over a
BlockReader. When you create a BlockCursor and read the first page using
it, it keeps the reference to the page. If you access the same page again,
it avoids going to page cache and quickly returns the same page again.
That can save a lot of lookups in the page cache if you perform multiple
reads.
The Blob-oriented API allows reading and writing "blobs" of arbitrary
length. It is a layer on top of the block-oriented API. When you write
a blob with the write_blob() function, it writes a length field
followed by the actual data to the underlying block storage, and
returns the offset where the blob was stored. The blob can be
retrieved later using the offset.
Finally, this replaces the I/O code in image-, delta-, and in-memory
layers to use the new abstractions. These replace the 'bookfile'
crate.
This is a backwards-incompatible change to the storage format.
We have these methods for some time in the API, so mentioning them in the
spec could be useful for console (see zenithdb/console#867), as we generate
pageserver HTTP API golang client there.
It happened in unit tests. If a thread tries to read a buffer while
already holding a lock on one buffer, the code to find a victim buffer
to evict could try to evict the buffer that's already locked. To fix,
skip locked buffers.
* Add a test case for reading historic page versions
Test read_page_at_lsn returns correct results when compared to page inspect.
Validate possiblity of reading pages from dropped relation.
Ensure funcitons read latest version when null lsn supplied.
Check that functions do not poison buffer cache with stale page versions.
Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is
WAL truncation, for which every safekeeper must know remote_consistent_lsn; the
next would be callmemaybe replacement.
Adds corresponding '--broker' argument to safekeeper and ability to run etcd in
tests.
Adds test checking remote_consistent_lsn is indeed communicated.
workspace_hack is needed to avoid recompilation when different crates
inside the workspace depend on the same packages but with different
features being enabled. Problem occurs when you build crates separately
one by one. So this is irrelevant to our CI setup because there we build
all binaries at once, but it may be relevant for local development.
this also changes cargo's resolver version to 2
Follow-up for #1417. Previously we had a problem uploading to S3
due to huge ammount of existing not yet uploaded data. Now we have a
fresh pageserver with LSM storage on staging, so we can try enabling it
once again.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.
Simplify Repository to a value-store
------------------------------------
Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.
It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.
Mapping from Postgres data directory to keys/values
---------------------------------------------------
All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.
The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.
'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.
The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.
Changes to low-level layer file code
------------------------------------
The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.
Checkpointing, compaction
-------------------------
The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.
Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.
Deployment
----------
This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.
Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
More rows, and test with serial and parallel plans. But fewer iterations,
so that the tests run in < 1 minutes, and we don't need to mark them as
"slow".
With a Mutex, only one thread could read from the layer at a time. I did
some ad hoc profiling with pgbench and saw that a fair amout of time was
spent blocked on these Mutexes.
It doesn't make much sense to compare TimelineMetadata structs with
< or >. But we depended on that in the remote storage upload code,
so replace BTreeSets with Vecs there.
* [proxy] Propagate most errors to user
This change enables propagation of most errors to the user
(e.g. auth and connectivity errors). Some of them will be
stripped of sensitive information.
As a side effect, most occurrences of `anyhow::Error` were
replaced with concrete error types.
* [proxy] Box weighty errors
Have separate routine and http endpoint to create timeline on safekeepers. It is
not used yet, i.e. timeline is still created implicitly, but we'll change that
once infrastructure for learning which tlis are assigned to which safekeepers
will be ready, preventing accidental creation by compute.
Changes format of safekeeper control file, allowing to store set of
peers. Knowing peers provides a part of foundation for peer
recovery (calculating min horizons like truncate_lsn for WAL truncation and
commit_lsn for sync-safekeepers replacement) and proper membership change;
similarly, we don't yet use it for now.
Employing cf file version bump, extracts tenant_id and timeline_id to top level
where it is more suitable. Also adds a bunch of LSNs there and rename
truncate_lsn to more specific peer_horizon_lsn.
* Add --id argument to safekeeper setting its unique u64 id.
In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.
Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.
* add node id to pageserver
This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
* new deployment flow for staging and production
* ansible playbooks and circleci config fixes
* cleanup before merge
* additional cleanup before merge
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* bianries artifacts path fix for ansible playbooks
* deployment flow refactored
* base64 decode fix for ssh key
* fix for console notification and production deploy settings
* cleanup after deployment tests
* fix - trigger release binaries download for production deploy
When several AppendRequest's can be read from socket without blocking,
they are processed together and fsync() to segment file is only called
once. Segment file is no longer opened for every write request, now
last opened file is cached inside PhysicalStorage. New metric for WAL
flushes was added to the storage, FLUSH_WAL_SECONDS. More errors were
added to storage for non-sequential WAL writes, now write_lsn can be
moved only with calls to truncate_lsn(new_lsn).
New messages have been added to ProposerAcceptorMessage enum. They
can't be deserialized directly and now are used only for optimizing
flushes. Existing protocol wasn't changed and flush will be called for
every AppendRequest, as it was before.
Since commit fdd987c3ad, it was only used in InMemoryLayers. Let's
just "inline" the code into InMemoryLayer itself.
I originally did this as part of a bigger PR (#1267). With that PR,
one in-memory layer, and one ephemeral file, would hold page versions
belonging to multiple segments. Currently, PageVersions can only hold
versions for a single segment, so that would need to be changed.
Rather than modify PageVersions to support that, just remove it
altogether.
These tests have intimate knowledge of the directory layeout and layer
file names used by the LayeredRepository implementation of the
Repository trait. Move them, so that all the tests that remain in
repository.rs are expected to work without changes with any
implementation of Repository. Not that we have any plans to create
another Repository implementaiton any time soon, but as long as we
have the Repository interface, let's try to maintain that abstraction
in the tests too.
The test creates a page version with a string like "foo 123 at 0/10"
as the content. But the LSN stored in that string was wrong: the page
version stored at LSN 0/20 would say "foo <blk> at 0/10".
wal_storage.rs was split up from timeline.rs, safekeeper.rs and send_wal.rs,
and now contains all WAL related code from the safekeeper. Now there are
PhysicalStorage for persisting WAL to disk and WalReader for reading it.
This allows optimizing PhysicalStorage without affecting too much of other
code.
Also there is a separate structure for persisting control file now in
control_file.rs.
Previous version of spec caused parsing errors in generated clients
as return type is object not array, also one field was missing. In
a passing set `format: hex` on ancestor_id too as value conforms to
that format.
This change makes most parts of the code asynchronous, except
for the `mgmt` subsystem (we're going to drop it anyway).
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
If a heap UPDATE record modified two pages, and both pages needed to have
their VM bits cleared, and the VM bits were located on the same VM page,
we would emit two ZenithWalRecord::ClearVisibilityMapFlags records for
the same VM page. That produced warnings like this in the pageserver log:
Page version Wal(ClearVisibilityMapFlags { heap_blkno: 18, flags: 3 }) of rel 1663/13949/2619_vm blk 0 at 2A/346046A0 already exists
To fix, change ClearVisibilityMapFlags so that it can update the bits
for both pages as one operation.
This was already covered by several python tests, so no need to add a
new one. Fixes#1125.
Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
It was printing a lot of stuff to the log with INFO level, for routine
things like receiving or sending messages. Reduce the noise. The amount
of logging was excessive, and it was also consuming a fair amount of CPU
(about 20% of safekeeper's CPU usage in a little test I ran).
* Always initialize flush_lsn/commit_lsn metrics on a specific timeline, no more `n/a`
* Update flush_lsn metrics missing from cba4da3f4d
* Ensure that flush_lsn found on load is >= than both commit_lsn and truncate_lsn
* Add some debug logging
Use GUC zenith.max_cluster_size to set the limit.
If limit is reached, extend requests will throw out-of-space error.
When current size is too close to the limit - throw a warning.
Add new test: test_timeline_size_quota.
Use log::error!() instead. I spotted a few of these "connection error"
lines in the logs, without timestamps and the other stuff we print for
all other log messages.
Timeline is active whenever there is at least 1 connection from compute or
pageserver is not caught up. Currently 'active' means callmemaybes are being
sent.
Fixes race: now suspend condition checking and callmemaybe unsubscribe happen
under the same lock.
Now it's possible to call Fe{Startup,}Message in both
sync and async contexts, which is good for proxy.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
* Fix checkpoint.nextXid update
* Add test for cehckpoint.nextXid
* Fix indentation of test_next_xid.py
* Fix mypy error in test_next_xid.py
* Tidy up the test case.
* Add a unit test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Freeze vectors at the same end LSN
* Fix calculation of last LSN for inmem layer
* Do not advance disk_consistent_lsn is no open layer was evicted
* Fix calculation of freeze_end_lsn
* Let start_lsn be larger than oldest_pending_lsn
* Rename 'oldest_pending_lsn' and 'last_lsn', add comments.
* Fix future_layerfiles test
* Update comments conserning olest_lsn
* Update comments conserning olest_lsn
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Now zenith_cli handles wal_acceptors config internally, and if we
will append wal_acceptors to postgresql.conf in python tests, then
it will contain duplicate wal_acceptors config.
Currently ztimelineids are unique, but all APIs accept the pair, so let's keep
it everywhere for uniformity.
Carry around ZTTId containing both ZTenantId and ZTimelineId for simplicity.
(existing clusters on staging ought to be preprocessed for that)
* Reproduce github issue #1047.
* Use RwLock to protect gc_cuttof_lsn
* Eeduce number of updates in test_gc_aggressive
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages test
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages
* Lock latest_gc_cutoff_lsn in all operations accessing storage to prevent race conditions with GC
* Remove random sleep between wait_for_lsn and get_page_at_lsn
* Initialize latest_gc_cutoff with initdb_lsn and remove separate check that lsn >= initdb_lsn
* Update test_prohibit_branch_creation_on_pre_initdb_lsn test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
to pass current_timeline_size to compute node
Put standby_status_update fields into ZenithFeedback and send them as one message.
Pass values sizes together with keys in ZenithFeedback message.
This patch includes attach/detach http endpoints in pageservers. Some
changes in callmemaybe handling inside safekeeper and an integrational
test to check migration with and without load. There are still some
rough edges that will be addressed in follow up patches
Mainly because it has better support for installing the packages from
different python versions.
It also has better dependency resolver than Pipenv. And supports modern
standard for python dependency management. This includes usage of
pyproject.toml for project specific configuration instead of per
tool conf files. See following links for details:
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/https://www.python.org/dev/peps/pep-0518/
* Do not delete layers beyand cutoff LSN
* Update pageserver/src/layered_repository/layer_map.rs
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.
Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.
This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
Log the error and continue. Hopefully it's a transient failure.
This might have been happening in staging earlier, when the safekeeper
had a problem where it opened connections very frequently to issue
"callmemaybe" commands. If you launch too many threads too fast, you might
run out of file descriptors or something. It's not totally clear what
happened, but with commit, at least the page server will continue to run
and accept new connections, if a transient error happens.
'anyhow' crate can include a backtrace in all errors, when the
'backtrace' feature is enabled. Enable it, and change the places that used
'{:#}' or '{}' to '{:?}', so that the backtrace is printed.
A timeline ID is only guaranteed to be unique for a particular tenant,
so you need to use tenant ID + timeline ID as the key, rather than just
timeline ID.
The safekeeper currently makes the same assumption, and we should fix that
too, but this commit just addresses this one case in the page server.
In the passing, reorder some function arguments to be more consistent.
The walkeeper launch two threads for each connection, and uses a guard
object to remove entry from 'replicas' array, when finishes. But only
the background thread held onto the guard object, so if the background
thread finished before the other thread, the array entry would be
removed prematurely, which lead to panic in the check_stop_streaming()
call.
Fixes https://github.com/zenithdb/zenith/issues/1103
Hexalize zids there for better output; since Serde doesn't support several
formats for one struct, on-disk representation is changed as well, make
upgrade.rs cope with it.
to avoid a subtle race condition.
Without safekeeper, walreceiver reconnection can stuck,
because of IO deadlock between walsender auth and regular backend.
* Do not hold timelines lock during GC
refer #1087
* Add gc_cs mutex for preveting creation of new timelines during GC
* Make clippy happy
* Use Mutex<()> instead of Mutex<i32> for GC critical section
Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL
record that is replayed with the Postgres WAL redo process, or a built-in
type that is handled entirely by pageserver code.
Replace the special code to replay Postgres XACT commit/abort records
with new Zenith WAL records. A separate zenith WAL record is created for
each modified CLOG page. This allows removing the 'main_data_offset'
field from stored PostgreSQL WAL records, which saves some memory and
some disk space in delta layers.
Introduce zenith WAL records for updating bits in the visibility map.
Previously, when e.g. a heap insert cleared the VM bit, we duplicated the
heap insert WAL record for the affected VM page. That was very wasteful.
The heap WAL record could be massive, containing a full page image in
the worst case. This addresses github issue #941.
The first COPY generates about 230 MB of write I/O, but the second
COPY, after deleting most of the rows and vacuuming the rows away,
generates 370 MB of writes. Both COPYs insert the same amount of data,
so they should generate roughly the same amount of I/O. This commit
doesn't try to fix the issue, just adds a test case to demonstrate it.
Add a new 'checkpoint' command to the pageserver API. Previously,
we've used 'do_gc' for that, but many tests, including this new one,
really only want to perform a checkpoint and don't care about GC. For
now, I only used the command in the new test, though, and didn't
convert any existing tests to use it.
It creates busy loop if pageserver <-> safekeeper connection fails after it was
established (e.g. currently due to 'segment checkpoint not found' error on
pageserver).
Also wake up callmemaybe thread regularly once in recall_period regardless of
channel activity.
This patch allows to shutdown wal receiver when there are no messages
and wal receiver is blocked inside tokio-postgres. In this case it
cannot check the shutdown flag.
This patch switches to use async interface of tokio-postgres directly
without sync wrappers. It opens the possibility to use tokio::select!
between the phsycal_stream.next() and a shutdown channel readiness to
interrupt replication process.
Also this allows to shutdown only particular wal receiver without
using global shutdown_requested flag.
Currently it's included with minimal changes and lives aside of the main
workspace. Later we may re-use and combine common parts with zenith
control_plane.
This change is mostly needed to unify cloud deployment pipeline:
1.1. build compute-tools image
1.2. build compute-node image based on the freshly built compute-tools
2. build zenith image
So we can roll new compute image and new storage required by it to
operate properly. Also it becomes easier to test console against some
specific version of compute-node/-tools.
If safekeepers sync fast enough, callmemaybe thread may never make a call before receiving Unsubscribe request. This leads to the situation, when pageserver lacks data that exists on safekeepers.
Do it separately with SafekeeperPostgresCommand enum as a result. Since query is
always C string, switch postgres_backend process_query argument from Bytes to
&str.
Make passing ztli/ztenant id in safekeeper connection string optional; this is
needed for upcoming intra-safekeeper heartbeat cmd which is not bound to any
timeline.
Change meaning of lsns in HOT_STANDBY_FEEDBACK:
flush_lsn = disk_consistent_lsn,
apply_lsn = remote_consistent_lsn
Update compute node backpressure configuration respectively.
Update compute node configuration:
set 'synchronous_commit=remote_write' in setup without safekeepers.
This way compute node doesn't have to wait for data checkpoint on pageserver.
This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.
Introduce builder objects, DeltaLayerWriter and ImageLayerWriter.
This gives more flexibility, as the DeltaLayer::create and
ImageLayer::create functions don't need to know about the details of
the format of where the page versions are coming from. This allows us
to change the format used in InMemoryLayer more easily, without having
to modify Delta- and ImageLayer code.
Also refactor the code in InMemoryLayer::write_to_disk for clarity.
Previously, the 'blknum' argument of various Layer functions was the
block number within the overall relation. That was pretty confusing,
because an individual layer only holds data from a one segment of the
relation. Furthermore, the 'put_truncation' function already dealt
with per-segment size, not overall relation size, adding to the
confusion.
Change the meaning of the 'blknum' argument to mean the block number
within the segment, not the overall relation.
If a commit record contains XIDs that are stored on different CLOG pages,
we duplicate the commit record for each affected CLOG page. In the redo
routine, we must only apply the parts of the record that apply to the
CLOG page being restored. We got that right in the loop that handles the
sub-XIDs, but incorrectly always set the bit that corresponds to the main
XID.
The logic to compute the page number was broken, and as a result, only
the first page of multixact members was updated correctly. All the
rest were left as zeros. Improve test_multixact.py to generate more
multixacts, to cover this case.
Also fix the check that the restored PG data directory matches the
original one. Previously, the test compared the 'pg_new' cluster,
which is a bit silly because the test restored the 'pg_new' cluster
only a few lines earlier, so if the multixact WAL redo is somehow
broken, the comparison will just compare two broken data directories
and report success. Change it to compare the original datadir, the one
where the multixacts were originally created, with a restored image of
the same.
WAL stream uses the 2 connections:
1. Compute node (walproposer) -> Safekeeper (ReceiveWalConn module)
When compute node is shut down, safekeeper needs to stop the respective receiving thread.
Prior to this PR it didn't work because PostgresBackend haven't handled disconnection properly.
2. Safekeeper (ReplicationConn module) -> pageserver (walreceiver thread)
When incoming WAL stream is gone, safekeeper can stop streaming WAL and cancel connection as soon as replica is caught up.
Note that the WAL can be streamed to multiple replicas simultaneously, only disconnect ones that are caught up to the last_recieved_lsn.
The ephemeral files are not usable after restart, so just delete them.
Before this, you got "unrecognized filename in timeline dir" warnings
about them, as Konstantin noted at:
https://github.com/zenithdb/zenith/issues/906#issuecomment-995530870.
While we're at it, refactor away the list_files() function, moving the
logic fully into the caller. Seems more straightforward.
Rename save_decoded_record() to ingest_record(), and move the
responsibility for decoding the record into ingest_record().
Also move the responsibility of updating the CheckPoint relish to
ingest_record(). Put it in a new WalIngest struct, to help with tracking
that.
We depends on rustls in postgres_backend anyway, so might as well use it
for all TLS stuff. Seems better to depend on only one library both from a
security point of view, and because fewer dependencies means less code to
compile. With this commit, we no longer depend on OpenSSL.
Move the code for decoding a WAL stream into WAL records into
'postgres_ffi', and keep the code to parse the WAL records deeper in
'pageserver' crate, renamed to walrecord.rs.
This tidies up the dependencies a bit. 'walkeeper' reuses the same
waldecoder routines, and it used to depend on 'pageserver' because of
that. Now it only depends on 'postgres_ffi'.
(The comment in walkeeper/Cargo.toml that claimed that the dependency was
needed for ZTimelineId was obsolete. ZTimelineId is defined in
'zenith_utils', the dependency was actually needed for the waldecoder.)
0.28.0 includes two changes I submitted to upstream:
- Add support for older ListObjects API, needed to use rust-s3 with Google
Cloud Storage: https://github.com/durch/rust-s3/pull/229
- If file is smaller than one chunk, don't initiate multi-part upload.
https://github.com/durch/rust-s3/pull/228
These are not critical for Zenith right now, but let's stay up-to-date.
Currently, we return an all-zeros page if you request a block beyond end of
a relation. That has been implemented in LayeredTimeline::materialize_page,
so that if Layer::get_page_reconstruct_data returns Missing, it returns
and all-zeros page.
However InMemoryLayer and DeltaLayer would return Continue, not Missing,
in that case, and materialize_page would try to find the predecessor
layer. If there was a preceding image layer, then everything would still
work, but if there wasn't, it would return a "could not find predecessor
of layer" error. Fix that in InMemoryLayer and DeltaLayer, making them
check the size of the relation and return Missing in that case.
This is hard to reproduce at the moment, but it happened quickly with
pgbench when I modified InMemoryLayer::write_to_disk so that it didn't
always create a new ImageLayer.
- Don't spawn a separate thread for each connection.
Instead use one thread per safekeeper, that iterates over all connections and sends callback requests for them.
-Use tokio postgres to connect to the pageserver, to avoid spawning a new thread for each connection.
callmemaybe review fixes:
- Spawn all request_callback tasks separately.
- Remember 'last_call_time' and only send request_callback if 'recall_period' has passed.
- If task hasn't finished till next recall, abort it and try again.
- Add pause/resume CallmeEvents to avoid spamming pageserver when connection already established.
This is needed for implementation of tenant rebalancing. With this
change safekeeper becomes aware of which pageserver is supposed to be
used for replication from this particular compute.
Should have been a part of cba4da3f4d to provide upgrade for previously
existing clusters. Separates version independent header (magic + version) out of
SafeKeeperState to choose what to deserialize.
This patch introduces fixes for several problems affecting
LLVM-based code coverage:
* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.
* Implement proper shutdown handlers in safekeeper.
write_lsn - The last LSN received and processed by pageserver's walreceiver.
flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers.
apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).
Persist full history of term switches on safekeepers instead of storing only the
single term of the highest entry (called epoch). This allows easily and
correctly find the divergence point of two logs and truncate the obsolete part
before overwriting it with entries of the newer proposer(s).
Full history of the proposer is transferred in separate message before proposer
starts streaming; it is immediately persisted by safekeeper, though he might not
yet have entries for some older terms there. That's because we can't atomically
append to WAL and update the control file anyway, so locally available WAL must
be taken into account when looking at the history.
We should sometimes purge term history entries beyond truncate_lsn; this is not
done here.
Per https://github.com/zenithdb/rfcs/pull/12Closes#296.
Bumps vendor/postgres.
While we're at it, reuse the Book and the VirtualFile that's backing
it even over unload() calls. Previously, we would keep the Book open,
but on load(), we would re-open it anyway, which didn't make much
sense. Now we reuse it it. Alternatively, perhaps we should close it
on unload() to save some memory, but this I'm not going to think too
hard about it right now as the whole load/unload thing is a bit of a
hack and needs to be rewritten.
This is hard to reproduce ATM, because the incorrect state would get
fixed by an unload(). A checkpoint creates the DeltaLayer, and it also
calls unload() afterwards, so the window is not very large. I hit it
occasionally with a scale 1000 pgbench test, after I had modified
InMemoryLayer::write_to_disk() to not write an image layer every time,
which made the DeltaLayers be accessed more often.
This doesnt show up at the moment, because we never create a delta
layer with end-LSN equal to the last LSN. We always create an image
layer at that LSN instead. For example, if the latest processed LSN is
100, we would create a delta layer with end LSN 100 (exclusive), and
an image layer at 100. But that's just how InMemoryLayer::write_to_disk
happens to work at the moment, there's no fundamental reason it needs
to always create that image layer. I noticed this bug when I tried to
change the logic in InMemoryLayer::write_to_disk to only create an
image layer after a few delta layers.
The "in-memory layer" is misnomer now, each in-memory layer is now actually
backed by a file. The files are ephemeral, in that they don't survive page
server crash or shutdown.
To avoid reading the file for every operation,
"ephemeral files" are cached in a page cache.
This includes changes from 'inmemory-layer-chunks' branch to serialize /
the page versions when they are added to the open layer. The difference is
that they are not serialized to the expandable in-memory "chunk buffer", but
written out to the file.
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.
To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
when basebackup retured an error. Previously pageserver thread just
died.
* Remove "ancestor" file which previously contained ancestor id and
branch lsn. Currently the same data can be obtained from metadata file.
And just the way we handled ancestor file in the code introduced the
case when branching fails timeline directory is created but there is no data in it
except ancestor file. And this confused gc because it scans
directories. So it is better to just remove ancestor file and clean up
this timeline directory creation so it happens after all validity
checks have passed
Ever since we've had frozen in-memory layers, having an 'end_lsn' no
longer means that the layer has been dropped. Need to check the 'dropped'
flag explicitly.
This was reliably causing a failure on the new 'test_parallel_copy' test
in https://github.com/zenithdb/zenith/pull/864. I'm not sure why it
doesn't happen on main branch, but the bug is pretty straightforward when
you see it.
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.
The buffer cache is shared across all tenants, allowing memory to be
dynamically allocated where it's needed the most. The cache works on 8 kB
pages, and uses the clock algorithm for replacement policy; same as the
PostgreSQL buffer cache.
One peculiarity is that the materialized page versions can be looked up
by an inexact LSN, to find the latest page version with an LSN >= the
search key.
The code is structured to support caching other kinds of pages in the same
cache in the future, but with a different mapping key.
Co-authored-by: Patrick Insinger <patrick@zenith.tech>
During parallel load of a table, Postgres sometimes requests a page from
the page server for which no WAL has been generated yet. That's normal;
Postgres expects the page to be full of zeros. There was a special case
for that in LayeredTimeline::materialize_page, but the problem remained
when you're crossing a segment boundary, so that there's no layer for
the segment at all.
It would be nice to have a more robust cross-check for this case. That
might need help from the Postgres side. But this extends the bandaid fix
we had in materialize_page() to the case where cross segment boundary.
Fixes https://github.com/zenithdb/zenith/issues/841
There were two separate locking issues that could lead to a deadlock,
both related to holding a lock for longer than necessary:
1. In the loop in `VirtualFile::with_file`, the "handle_guard" was
held across iterations of the loop. Because of that, if the handle was
changed by a concurrent thread, the loop would try to acquire the
handle lock, when it was still holding the lock from previous
iteration. To fix, release the lock earlier. There was no need to hold
it across iterations, it was just accidental.
2. In the same function, we also held the "slot_guard" longer than
necessary. It's only needed in the first part of the loop, where we
check if the current handle is valid. If it's not, the slot lock can
be immediately released. But it was not, it was kept over the
acquisition of the handle lock. I'm not sure if that alone could cause
problems, but let's release the lock as soon as possible anyway.
Add a test case, based on Konstantin's test program to demonstrate the
deadlock.
Currently, whenever a page version is needed from an image or delta
layer, we open the file and read and parse the bookfile headers. That's
pretty expensive. To reduce the overhead, introduce a cache of open file
descriptors, and use that to cache the Book objects so that we don't need
to read the metadata on every access.
Now safekeeper control file updated in a following way:
1. Write data to temp file
2. Fsync the temporary file (if sync option is specified)
3. Rename temporary file to actual control file
4. Fsync containing directory (if sync option is specified)
5. Fsync file after rename (if sync option is specified).
Note that action 5 is not mentioned anywhere as required but it is done
in postgres this way (see durable_rename).
Also because of the rename machinery switch to use dedicated lock file
to prevent running several safekeepers concurrently on the same data
cleanup
fsync control file after rename to match postgres behaviour
* change zenith-perf-data checkout ref to be main
* set cluster id through secrets so there is no code changes required
when we wipe out clusters on staging
* display full pgbench output on error
Commit 960c7d69a8 changed the LSN returned in the Continue case in
InMemoryLayer::get_page_reconstruct_data(), but neglected to make the
same change in DeltaLayer.
Also add an escape hatch to the loop in materialize_page() to avoid
getting stuck in an infinite loop, if a bug like this reoccurs.
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
The tokio futures added some overhead, so switch to plain non-blocking
I/O with poll(). In a simple pgbench test on my laptop (select-only
queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10%
improvement, from about 4300 TPS to 4800 TPS.
Instead of building a separate Vec<u8> to hold each message, serialize all
the messages to one big Vec<u8>. This eliminates some Vec allocation and
memcpy() overhead. The downside is that if there are a lot of records to
replay, we have to serialize them all into one big chunk of memory.
That shouldn't be a problem in practice. If you need to replay millions
of records to reconstruct a page, we should've materialized a new image
of that page earlier already.
tests are based on self-hosted runner which is physically close
to our staging deployment in aws, currently tests consist of
various configurations of pgbenchi runs.
Also these changes rework benchmark fixture by removing globals and
allowing to collect reports with desired metrics and dump them to json
for further analysis. This is also applicable to usual performance tests
which use local zenith binaries.
We might want to have custom serialize/deserialize functions for
WALRecords and PageVersions for performance reasons, see github issue 832.
But that would probably look a bit different from this, and currently
these functions are just dead.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.
After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.
This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.
Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
This calculation is not that heavy but it is needed only in tests, and
in case the number of tenants/timelines is high the calculation can take
noticeable time.
Resolves https://github.com/zenithdb/zenith/issues/804
The 'zenith' CLI utility can now be used to launch safekeepers. By
default, one safekeeper is configured. There are new 'safekeeper
start/stop' subcommands to manage the safekeepers. Each safekeeper is
given a name that can be used to identify the safekeeper to start/stop
with the 'zenith start/stop' commands. The safekeeper data is stored
in '.zenith/safekeepers/<name>'.
The 'zenith start' command now starts the pageserver and also all
safekeepers. 'zenith stop' stops pageserver, all safekeepers, and all
postgres nodes.
Introduce new 'zenith pageserver start/stop' subcommands for
starting/stopping just the page server.
The biggest change here is to the 'zenith init' command. This adds a
new 'zenith init --config=<path to toml file>' option. It takes a toml
config file that describes the environment. In the config file, you
can specify options for the pageserver, like the pg and http ports,
and authentication. For each safekeeper, you can define a name and the
pg and http ports. If you don't use the --config option, you get a
default configuration with a pageserver and one safekeeper. Note that
that's different from the previous default of no safekeepers. Any
fields that are omitted in the configuration file are filled with
defaults. You can also specify the initial tenant ID in the config
file. A couple of sample config files are added in the control_plane/
directory.
The --pageserver-pg-port, --pageserver-http-port, and
--pageserver-auth options to 'zenith init' are removed. Use a config
file instead.
Finally, change the python test fixtures to use the new 'zenith'
commands and the config file to describe the environment.
We've seen some failures with "Address already in use" errors in the
tests. It's not clear why, perhaps some server processes are not cleaned
up properly after test, or maybe the socket is still in TIME_WAIT state.
In any case, let's make the tests more robust by checking that the port
is free, before trying to use it.
We generate the initial tenantid and store it in the file, so it shouldn't
be missing. But let's cope with it. (This comes handy with the bigger
changes I'm working on at https://github.com/zenithdb/zenith/pull/788)
Instead of having a lot of separate fixtures for setting up the page
server, the compute nodes, the safekeepers etc., have one big ZenithEnv
object that encapsulates the whole environment. Every test either uses
a shared "zenith_simple_env" fixture, which contains the default setup
of a pageserver with no authentication, and no safekeepers. Tests that
want to use safekeepers or authentication set up a custom test-specific
ZenithEnv fixture.
Gathering information about the whole environment into one object makes
some things simpler. For example, when a new compute node is created,
you no longer need to pass the 'wal_acceptors' connection string as
argument to the 'postgres.create_start' function. The 'create_start'
function fetches that information directly from the ZenithEnv object.
Each test now gets its own test output directory, like
'test_output/test_foobar', even when TEST_SHARED_FIXTURES is used.
When TEST_SHARED_FIXTURES is not used, the zenith repo for each test
is created under a 'repo' subdir inside the test output dir, e.g.
'test_output/test_foobar/repo'
The -D option to specify working directory was broken:
$ mkdir foobar
$ ./target/debug/safekeeper -D foobar
Error: failed to open "foobar/safekeeper.log"
Caused by:
No such file or directory (os error 2)
This was because we both chdir'd into to specified directory, and also
prepended the directory to all the paths. So in the above example, it
actually tried to create the log file in "foobar/foobar/safekepeer.log"
Change it to work the same way as in the pageserver: chdir to the
specified directory, and leave 'workdir' always set to ".".
We wouldn't necessarily need the 'workdir' variable in the config at all,
and could assume that the current working directory is always the
safekeeper data directory, but I'd like to keep this consistent with the
the pageserver. The page server doesn't assume that for the sake of unit
tests. We don't currently have unit tests in the safekeeper that write
to disk but we might want to in the future.
* We actually need Python 3.7 because of dataclasses
* Rerun 'pipenv lock' under Python 3.7 and add 'pipenv' to dev deps
* Update docs on developing for Python 3.7
* CircleCI: use Python 3.7 via Docker image instead of Orb
* Fix bugs found by mypy
* Add some missing types and runtime checks, remove unused code
* Make ZenithPageserver start right away for better type safety
* Add `types-*` packages to Pipfile
* Pin mypy version and run it on CircleCI
This change causes writer halves of a TLS stream to always flush after a
portion of bytes has been written by `std::io::copy`. Furthermore, some
cosmetic and minor functional changes are made to facilitate debug.
* Add yapf run to CircleCI
* Pin yapf version
* Enable `SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES` setting
* Reformat all existing code with slight manual adjustments
* test_runner/README: note that yapf is forced
Change 'zenith.signal' file to a human-readable format, similar to
backup_label. It can contain a "PREV LSN: %X/%X" line, or a special
value to indicate that it's OK to start with invalid LSN ('none'), or
that it's a read-only node and generating WAL is forbidden
('invalid').
The 'zenith pg create' and 'zenith pg start' commands now take a node
name parameter, separate from the branch name. If the node name is not
given, it defaults to the branch name, so this doesn't break existing
scripts.
If you pass "foo@<lsn>" as the branch name, a read-only node anchored
at that LSN is created. The anchoring is performed by setting the
'recovery_target_lsn' option in the postgresql.conf file, and putting
the server into standby mode with 'standby.signal'.
We no longer store the synthetic checkpoint record in the WAL segment.
The postgres startup code has been changed to use the copy of the
checkpoint record in the pg_control file, when starting in zenith
mode.
This is in preparation for supporting read-only nodes. You can launch
multiple read-only nodes on the same brach, so we need an identifier
for each node, separate from the branch name.
Previously, the first WAL record on the 'main' branch overwrote the
initial checkpoint record, with invalid 'xl_prev'. That's harmless, but
also pretty ugly. I bumped into this while I was trying to tighen up the
checks for when a valid 'prev_lsn' is required. With this patch, the
first WAL record gets a valid 'xl_prev' value. It doesn't matter much
currently, but let's be tidy.
Which is mainly generational state (terms) and useful LSNs.
Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes#726.
ref #115
This adds a fast-path for the common case that the record doesn't
cross a page boundary. We now split off a new Bytes directly from the
original input buffer in that case, instead of copying the record to a
new BytesMut. Shaves about 5% of the page server's CPU time on my
laptop, in the 'test_bulk_insert' test.
* Use logging in python tests
* Use f-strings for logs
* Don't log test output while running
* Use only pytest logging handler
* Add more info about pytest logging
* Rename wal_acceptor binary to safekeeper
* Rename wal_acceptor.pid and wal_acceptor.log to safekeeper.pid and safekeeper.log
* Change some mentions of WAL acceptor to safekeeper
* Dockerfile: alias wal_acceptor to safekeeper temporarily until internal scripts are updated
This change brings the following improvements to our build system:
* Now BUILD_TYPE also affects rust apps.
* From now on, cargo will respect `-jN` passed via `make`. However, note
that `rustc` may spawn multiple threads depending on compile flags.
* Cargo is able to cooperate with make to better schedule parallel jobs,
which leads to better build times (-20s in release mode on my machine).
No need to use BytesMut in these functions. Plain Vec is simpler. And
should be marginally faster too; I saw BytesMut functions previously
in 'perf' profile, consuming around 5% of the overall pageserver CPU
time. That's gone with this patch, although I don't see any discernible
difference in the overall performance test results.
Previously, the WAL receiver we would make a decoded copy of the current
Checkpoint before each WAL record, and compare it with the Checkpoint
after the record has been processed. If it has changed, the checkpoint
relish is updated in the repository. That's somewhat expensive, the
Checkpoint::encode() function is visible in 'perf' profile. Change that
so that we set a flag whenever the Checkpoint struct is modified, so that
we dont need to compare the whole struct anymore.
Previously, typos like `BUILD_TYPE=rlease` would silently
lead to building debug binaries. The current approach is also
more future-proof, since we might add `profile`, `valgrind`
as well as other build types.
- perform checkpoint for each tenant repository.
- wait for the completion of all threads.
Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.
When a WAL record affects multiple pages, we currently duplicate the
record for each affected page. That's a bit wasteful, but not too bad
for b-tree splits and non-hot heap updates that affect two pages. But
buffering GiST index build WAL-logs the whole relation in 32 page chunks,
with one giant WAL record for each 32-page chunk. Currently we duplicate
that giant record for each of the 32 pages, which is really wasteful.
Github issue https://github.com/zenithdb/zenith/issues/720 tracks the
problem. This commit adds a test case for it to demonstrate it.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.
This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.
Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.
We now obey the RUST_LOG env variable, if it's set.
The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing. For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
* `wal_acceptor`: add HTTP handler, /metrics endpoint only, no authentication
* Two gauges are currently reported: `flush_lsn` and `commit_lsn`
* Add `DEFAULT_PG_LISTEN_PORT` and `DEFAULT_PG_LISTEN_PORT` consts for uniformity
The caller is now responsible for lookin up the predecessor layer,
instead. This makes the code simpler, as you don't need to update the
predecessor reference when a layer is frozen or written to disk.
There was a bug in that, as Konstantin noted on discord:
Assume that freeze doesn't create new inmem layer
(maybe_new_open=None). Then we temporary place in historics frozen
layer. Assume that now new put_wal_record request arrives. There is
no open in-mem layer, so it has to create new one. It is looking for
previous layer for read and set it as new in-mem layer
predecessor. But as far as I understand, prev layer should be our
temporary frozen layer. Which will be then removed from
historics.
That leaves the predecessor field of the new in-memory layer pointing
at the frozen in-memory layer that has been removed from the layer map,
preventing it from being removed from memory.
This makes two subtle changes:
1. When the first new layer is created on a branch for a segment that
existed on the ancestor branch, the start_lsn of the new layer is now
the branch point + 1. We were previously slightly confused on what
the branch point LSN meant. It means that all the WAL up to and
*including* the LSN on the old branch is visible to the new branch.
If we mark the start LSN of the new layer as equal to the branch point,
that's wrong, because if there is a WAL record with that LSN on the
predecessor layer, the new layer would hide it. This bug was hidden
when the layer on the new branch contained a direct reference to the
layer in the old branch, as get_page_reconstruct_data() followed that
reference directly when it didn't find the page version in the new
layer. But now that the caller performs the lookup, it will look up
the new layer that doesn't contain the record, and you get an error.
2. InMemoryLayer now always stores the segment size at the beginning
of the layer's LSN range. Previously, get_seg_size() might have
recursed into the predecessor layer to get the size, but now we
avoid that by always copying over the last size from the previous
layer, when a new layer is created.
The dockerignore and dockerfile have also been excluded from being moved into
docker images, saving docker layer cache busts if only those are changed.
Commit ca9af37478 removed the import_timeline_wal() call from here.
After that, the info!() message is bogus, as we no longer load the WAL
from local disk. Also, the logical size assertion is pointless now.
All the changes are in the vendor/postgres side. However, because we now
generate fewer Full Page Writes, the 'branch_behind' test needs to be
modified so that it still generates enough WAL to consume a few WAL
segments.
The metadata file is now always 512 bytes. The last 4 bytes are a
crc32c checksum of the previous 508 bytes. Padding zeroes are added
between the serde serialization and the start of the checksum.
A single write call is used, and the file is fsyncd after.
On file creation, the parent directory is fsyncd as well.
It previously took &SafeKeeperState similar to persist(), but only for its
`server` member.
Now it takes &ServerInfo only, so there it's clear the state is not persisted.
Also added a comment about sync.
* Send ProposerGreeting manually in tests
* Move test_sync_safekeepers to test_wal_acceptor.py
* Capture test_sync_safekeepers output
* Add comment for handle_json_ctrl
* Save captured output in CI
* make .dockerignore `ncdu -X` compatible to easily inspect build context
* remove cargo-chef as it was introducing more problems than it was solving
* remove rocksdb packages
* add ca-certs in the resulting image. We need that to be able to make https
connections from container with proxy to the console.
Similar to what commit 7fb7f67b did to 'freeze', dealing with the
dropped segment separately from the rest of the logic makes the code
easier to follow. It is also needed by the next commit that replaces
the code to build new BTreeMap with an iterator; we cannot pass one
of two kinds of closures as argument, it has to always be the same one.
Having separate DeltaLayer::create() calls for the case of dropped
segment and the other cases works around that.
Positive EncryptionResponse should set 'S' byte, not 'Y'. With that
fix it is possible to connect to proxy with SSL enabled and read
deciphered notice text. But after the first query everything stucks.
rsa_private_keys() function returns an empty vector when tries to read
pkcs8-encoded file instead of returning an error. So previous check was
failing on pkcs8. Leave only pkcs8 for now.
Reduces the CPU time spent in the write() syscalls. I noticed that we were
spending a lot of CPU time in libc::write, coming from request_redo(), in
the 'bulk_insert' test. According to some quick profiling with 'perf',
this reduces the CPU time spent in request_redo() from about 30% to 15%.
For some reason, it doesn't reduce the overall runtime of the 'bulk_insert'
test much, maybe by one second if you squint (from about 37s to 36s), so
there must be some other bottleneck, like I/O. But this is surely still
a good idea, just based on the reduced CPU cycles.
Commit message copied below:
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
This introduces a new tree data structure for holding intervals, and
queries of the form "which intervals contain the given point?". It then
uses that to store the Layers in the layer map, instead of the BTreeMap.
While we don't currently create overlapping layers in the page server,
that situation might arise in the future if we start to create extra
layers for performance purposes, or as part of some multi-stage
garbage collection operation that creates new layers in some interval
and then removes old ones. The situation might also arise if you have
multiple page servers running on the same timeline, freezing layers at
different points, and both uploading them to S3.
So even though overlapping layers might not happen currently, let's
avoid getting confused if it does happen for some reason.
Fixes https://github.com/zenithdb/zenith/issues/517.
After this, a layer's start bound is always defined to be inclusive, and
end bound exclusive.
For example, if you have a layer in the range 100-200, that layer can be
used for GetPage@LSN requests at LSN 100, 199, or anything in between.
But for LSN 200, you need to look at the next layer (if one exists).
This is one part of a fix for https://github.com/zenithdb/zenith/issues/517.
After this, the page server shouldn't create layers for the same segment
with the same LSN, which avoids the issue. However, the same thing would
still happen, if you managed to create layers with same start LSN again.
That could happen e.g. if you had two page servers running, or in some
weird crash/restart scenario, or due to bugs or features added later. The
next commit makes the layer map more robust, so that it tolerates that
situation without deleting wrong files.
New command has been added to append specially crafted records in safekeeper WAL. This command takes json for append, encodes LogicalMessage based on json fields, and processes new AppendRequest to append and commit WAL in safekeeper.
Python test starts up walkeepers and creates config for walproposer, then appends WAL and checks --sync-safekeepers works without errors. This test is simplest one, more useful test cases (like in #545) for different setups will be added soon.
Postgres commit message:
PQgetCopyData can sometimes indicate that the copy is done if the
backend returns an error response. So while we still expect that the
walkeeper never sends CopyDone, we can't expect it to never produce
errors.
- Turn dropped layers into non-writeable in get_layer_for_write().
- Handle non-writeable dropped layers in checkpointer. They don't need freezing, so just remove them from list of open_segs and write out to disk.
- Remove code that handles dropped layers in freeze() function. It is not used anymore.
Some dropped layers serve as tombstones for earlier layers and thus cannot be garbage collected.
Add new fields to GcResult for layers that are preserved as tombstones
anyhow uses the alternate formatting style ("{:#}") to display all of
the causes of an error instead of the outermost context.
Without this, there's less information available to figure out what's
going on. It's probably too much to display in the compute node logs
though, so it's better to leave that formatting as-is.
In this test safekeepers are restarted one by one, while bank transactions
are executed and validated in the background. Bank transactions consist of
balance transfers and log writes. In the end balance sum should remain the
same and there should be progress from every client, when 2 of 3 safekeeper
nodes are up.
It's not interesting for most tests, and clutters the output. If there
are individual tests where it is worthwhole, let's add pg_controldata calls
to those tests, but I don't think it's needed for now.
If the 'latest' flag in the client request is true, the client wants the
latest page version regardless of the LSN in the request. The LSN is just
a hint in that case, indicating that the page hasn't been modified since
since that LSN. The LSN can be very old, so it's possible that the page
server has already garbage collected away the layer at that LSN. We tried
to fetch the old layer and errored out if that happened. To fix, always
fetch the data as of last-record-LSN, if 'latest' is set in the client
request. We now only use the LSN to wait if the requested LSN hasn't been
received and processed yet.
Fixes https://github.com/zenithdb/zenith/issues/567
- Use different message formats for different kinds of response messages.
- Add an Error message, for passing errors from page server to Postgres.
Previously, we would respond to 'exists' request with 'false', and
to 'nblocks' request with 0, if an error happened. Fix those to return
an error message to the client. GetPage requests had a mechanism to
return an error, but it was just a flag with no error message.
- Add a flag to requests, to indicate that we actually want the latest
page version on the timeline, and the LSN is just a hint that we know
that there haven't been any modifications since that LSN. The flag isn't
used for anything yet, but I'm planning to use it to fix
https://github.com/zenithdb/zenith/issues/567
Most of the previous usages of get_repository_for_tenant were followed
by immediately getting a timeline in that repository, without keeping it
around for longer.
The new `get_timeline_for_tenant` function implements that same
behavior, but in one line.
- Change hardcoded OLDEST_INMEM_DISTANCE value to pageserver config option checkpoint_distance.
- Get rid of 'force' flag in checkpoint_internal(). Use checkpoint_distance=0 instead.
Support is done via pytest-xdist plugin.
To use the feature add -n<concurrency> to pytest invocation
e.g. pytest -n8 to run 8 tests in parallel.
Changes in code are mostly about ports assigning. Previously port for
pageserver was hardcoded without the ability to override through zenith
cli and ports for started compute nodes were calculated twice, in zenith
cli and in test code. Now zenith cli supports port arguments for
pageserver and compute nodes to be passed explicitly.
Tests are modified in such a way that each worker gets a non overlapping
port range which can be configured and now contains 100 ports. These
ports are distributed to test services (pageserver, wal acceptors,
compute nodes) so they can work independently.
Data written to frozen layers is lost. It will not appear in on-disk
structures or in successor InMemoryLayers. Here we detect this race, and
fail. I think this race is rare, but this should make it easier to track
down when it happens.
Implement the changes suggested in a comment, create
`get_layer_for_read_locked` so that `get_layer_for_write` doesn't have
to drop the LayerMap lock when searching for the predecessor.
This contains a lowest common denominator of pageserver and safekeeper log
initialisation routines. It uses daemonize flag to decide where to
stream log messages. In case daemonize is true log messages are
forwarded to file. Otherwise streaming to stdout is used. Usage of
stdout for log output is the default in docker side of things, so make
it easier to browse our logs via builtin docker commands.
This job will be responsible for triggering remote CI pipeline in
zenithdb/console repository. That way, we'll always know when
a PR to zenithdb/zenith breaks the cloud console app.
Otherwise restart of safekeeper before the first segment is filled makes it
report 0 as flushed LSN. To this end, tweak find_end_of_wal_segment to allow
starting from given LSN, not only from the start of the segment. While here,
make it less panicky.
Otherwise we produce corrupted record holes in WAL during compute node restart
in case there was an unfinished record from the old compute, as these reports
advance commit_lsn -- reliably persisted part of WAL.
ref #549.
Mostly by @knizhnik. I adjusted to make sure proposer always starts streaming
since record beginning so we don't need special quirks for decoding in
safekeeper.
Ran into problems launching the WAL redo process on OS X after 4b73ad.
Launching the `initdb` process was met with "bad file descriptor" errors.
Using dtrace, I found shortly after calling `posix_spawn` for `initdb`,
`kevent` was returning this error.
I haven't dug super deep to see if the daemonization itself is the
problem, but this commit fixes it for me. My hunch is that some file
descriptors used when the Tokio runtime is initailzed become invalid
in the daemon process.
When you advance last record LSN, *all* changes up to that LSN should be
imported into repository. We have been a bit sloppy about that when it
comes to the checkpoint information that we also store in the repository.
In WAL receiver, for example, we would receive a WAL record, advance
last record LSN, and only then update the checkpoint relish at the same
LSN. Reorder that so that you advance the last record LSN only after
updating the checkpoint relish. It hasn't apparently caused any problems
so far, but let's be tidy.
Tighten the check for that in get_layer_for_write(), so that it checks for
'lsn > last_record_lsn' rather than 'lsn >= last_record_lsn'.
Always use lsn(0) as the initial last_record_lsn. It is updated soon after
creating the timeline anyway, after loading the bootstrap data, so it
doesn't stay long in that state. I was a bit worried about using a special
value like 0, but it's actually nice that you can distinguish it from any
real LSN value. The unit tests have been using Lsn(0) as the initial start
LSN all along.
Move the responsibility to wait for the WAL to arrive to the callers, and
remove the wait_lsn() calls from the Timeline::get_page_at_lsn() and
friends. We were not totally consistent before, list_rels() was missing the
wait_lsn() call for example.
Closes https://github.com/zenithdb/zenith/issues/521
The other crates in this repository use zenithdb/rust-postgres as a
dependency for the related items, instead of the crates.io versions.
Switching to using that for the proxy as well removes an additional
three dependencies when we compile. (319 -> 316)
by binding sockets before daemonization
also use less annoying error reporting by not printing full error
messages for connect errors in first several connection retries
closes#507
In order to exclude problems with synchronizing disk and memory logical
size is not stored in metadata on disk. It is calculated on timeline
"start" by scanning the contents of layered repo and then size is maintained
via an atomic variable.
This patch also adds new endpoint to pageserver http api: branch detail.
It allows retrieval of a particular branch info by its name. Size info
is also added to the response of the endpoint and used in tests.
- Reorder the structs and functions
- Delegate many of the operations in LayerMap to SegEntry. For example,
`LayerMap::insert_open` now looks up the right SegEntry struct, and
then calls `SegEntry::insert_open` on it.
- Use HashMap::entry() function with or_default() to implement the lookups
with less code
Commit 66929ad6fb added a 'generation' number to open segments stored
in the layer map, to distinguish old layers from layers that were
added to the map during checkpoint processing. But it neglected the
OpenSegEntry::cmp() function.
It seems that the cmp() function is never used by BinaryHeap, so this
didn't cause any user-visible bugs (I tried adding a panic() to the
cmp() function and it didn't fire). But it's clearly wrong and we need
to fix it, anyway.
Compare files in existing compute node's pgdata with fresh basebackup at the same lsn. We expect that content is identical, except tmp files
Use it after some tests.
1) Do epoch switch without record from new epoch, immediately after recovery --
--sync-safekeepers mode doesn't generate new records.
2) Fix commit_lsn advancement by taking into account wal we have locally --
setting it further is incorrect.
3) Report it back to walproposer so he knows when sync is done.
4) Remove system id check as it is unknown in sync mode.
And make logging slightly better.
ref #439
Change control plane code to call `postgres --sync-safekeepers` before
compute node start when safekeepers are enabled. Now `pg create` will
create an empty data directory with the proper config file. Subsequent
`pg start` will run `sync-safekeepers` and will call basebackup with
the resulting LSN. Also change few tests to accommodate this new behavior.
In a passing fix two minor issues with basabackup:
* check that we can't create branches with pre-initdb LSN's
* normalize branch LSN's that are pointing to the segment boundary
patch by @knizhnik
closes#506
Now that the page server collects this metric (since commit 212920e47e),
let's include it in the performance test results
The new metric looks like this:
performance/test_perf_pgbench.py . [100%]
--------------- Benchmark results ----------------
test_pgbench.init: 6.784 s
test_pgbench.pageserver_writes: 466 MB <---- THIS IS NEW
test_pgbench.5000_xacts: 8.196 s
test_pgbench.size: 163 MB
=============== 1 passed in 21.00s ===============
This should fix the sporadic regression test failures we've been seeing
lately with "no base img found" errors.
This fixes the common case, but one corner case is still not handled:
If a relation is extended across a segment boundary, leaving a gap block
in the segment preceding the segment containing the target block, the
preceding segment will not be padded with zeros correctly. This adds
a test case for that, but it's commented out.
See github issue https://github.com/zenithdb/zenith/issues/500
To fix, break out of the loop when you reach an in-memory layer that was
created after the checkpoint started. To do that, add a "generation"
counter into the layer map.
Fixes https://github.com/zenithdb/zenith/issues/494
Previously, the InMemoryLayer and DeltaLayer implementations of
get_page_reconstruct_data would recursively call the predecessor layer's
get_page_reconstruct_data function. Refactor so that we iterate in the
caller instead. Make get_page_reconstruct_data() return the predecessor
layer along with the continuation LSN, so that the caller can iterate.
IMO this makes the logic more clear, although this is more lines of code.
DeltaLayer uses the name `predecessor` for the same thing. Use the
same name in InMemoryLayer. The 'img_layer' name was misleading, as
the predecessor layer is not necessarily an image layer. Currently,
the 'freeze' function always creates a new image layer, but it
wouldn't have to be that way. Also, when you create a new branch, at
the branch point the predecessor layer can be a delta layer on the
ancestor branch.
* add lsn argument
* do not expose wait_lsn, wait inside list_nonrels()
* fix parameters parsing
* expose get_last_record_rlsn() to atomically read (last,prev) pair
More work is needed to correctly handle basebackup@old_lsn but current
approach already allows to fix test_restart_compute
There are two main reasons for that:
a) Latest unfinished record may disapper after compute node restart, so let's
try not leak volatile part of the WAL into the repository. Always use
last_valid_record instead.
That change requires different getPage@LSN logic in postgres -- we need
to ask LSN's that point to some complete record instead of GetFlushRecPtr()
that can point in the middle of the record. That was already done by @knizhnik
to deal with the same problem during the work on `postgres --sync-safekeepers`.
Postgres will use LSN's aligned on 0x8 boundary in get_page requests, so we
also need to be sure that last_valid_record is aligned.
b) Switch to get_last_record_lsn() in basebackup@no_lsn. When compute node
is running without safekeepers and streams WAL directly
to pageserver it is important to match basebackup LSN and LSN of replication
start. Before this commit basebackup@no_lsn was waiting for last_valid_lsn
and walreceiver started replication with last_record_lsn, which can be less.
So replication was failing since compute node doesn't have requested WAL.
Make this test look like 'test_compute_restart.sh' by @ololobus, which
was surprisingly good for checking safekeepers behavior. This test adds
an intermediate compute node start with bulk select that causes a lot of
FPI's and select itself wouldn't wait for all that WAL to be replicated.
So if we kill compute node right after that we end up with lagging safekeepers
with VCL != flush_lsn. And starting new node from that state takes special
care.
Also, run and print `pg_controldata` output after each compute node start
to eyeball lsn/checkpoint info of basebackup.
This commit only adds test without fixing the problem.
I noticed that the timeline directory contained files like this:
pg_xact_0000_0_000000000169C3C2_00000000016BB399
pg_xact_0000_0_00000000016BB399
pg_xact_0000_0_00000000016BB399_00000000016BDD06
pg_xact_0000_0_00000000016BDD06
pg_xact_0000_0_00000000016BDD06_00000000016C63AA
pg_xact_0000_0_00000000016C63AA
pg_xact_0000_0_00000000016C63AA_0000000001765226_DROPPED
pg_xact_0000_0_0000000001765226
pg_xact_0001_0_00000000016BB77E_00000000016BDD06
pg_xact_0001_0_00000000016BDD06
pg_xact_0001_0_00000000016BDD06_0000000001765226_DROPPED
pg_xact_0001_0_0000000001765226
Note how there is an image file after each DROPPED file. It's a waste of
time and space to materialize an image of the file at the point where it's
dropped, no one is going to request pages on a dropped relation. And it's
a correctness issue too: list_rels() and list_nonrels() will not consider
the relation as unlinked, unless the latest layer indicates so, and there
is no concept of a dropped image layer. That was causing test_clog_truncate
test to fail, when I adjusted the checkpointer to force a checkpoint more
aggressively.
There are a bunch more issues related to dropped rels and branching,
see https://github.com/zenithdb/zenith/issues/502. Hence this doesn't
completely fix the issue I saw with test_clog_truncate either. But it's
a start.
The comment talked about the WAL redo thread, but commit 6e22a8f709
refactored that away. The problem the comment describes probably still
exists, so keep the comment, but update the wording.
Avoid slurping entire image files into memory.
For blocky segments, we write the bytes directly to a bookfile chapter.
The blocks are a fixed size, which allows for random access.
split the page versions into two chapters:
PAGE_VERSION_METAS - a rust BTreeMap from (block #, lsn) -> page & WAL
byte ranges in PAGE_VERSIONS_CHAPTER
PAGE_VERSIONS_CHAPTER - raw page images and serialized WAL records
Once upon a time, 'page_cache.rs' contained an actual page cache, but
it hasn't for a very long time. Rename to reflect what it actually does
these days.
This provides a pytest fixture to record metrics from pytest tests. The
The recorded metrics are printed out at the end of the tests.
As a starter, this includes on small test, using pgbench. It prints out
three metrics: the initialization time, runtime of 5000 xacts, and the
repository size after the tests.
epochStartLsn is the LSN since which new proposer writes its WAL in its epoch,
let's be more explicit here.
truncate_lsn is LSN still needed by the most lagging safekeeper. restart_lsn is
terminology from pg_replicaton_slots, but here we don't really have 'restart';
hopefully truncate word makes it clearer.
1) Extract consensus logic to safekeeper.rs.
2) Change the voting flow so that acceptor tells his epoch along with giving
the vote, not before it; otherwise it might get immediately stale. #294
3) Process messages from compute atomically and sync state properly. #270
4) Use separate structs for disk and network.
ref #315
Previously, a SnapshotLayer and corresponding file on disk contained the
base image of every page in the segment at the start LSN, and all the
changes (= WAL records) in the range between start and end LSN. That was
a bit awkward, because we had to keep the base image of every page in
memory until we had accumulated enough WAL after the base image to write
out the layer. When it's time to write out a layer, we would really want
to replay the WAL to reconstruct the most recent version of each page, to
save the effort later. That's on the assumption that the client will
usually request the most recent version, not some older one.
Split the SnapshotLayer into two structs: ImageLayer and DeltaLayer. An
image layer contains a "snapshot" of the segment at one specific LSN, and
no WAL records, whereas a delta layer contains WAL records in a range of
LSNs. In order to reconstruct a page version in the delta layer, by
performing WAL redo, you also need the previous image layer. So the delta
layers are "incremental" against the previous layer.
So where previously we would create snapshot files like this:
rel_100_200
rel_200_300
rel_300_400
We now create image and delta files like this:
rel_100 # image
rel_100_200 # delta
rel_200
rel_200_300
rel_300
rel_300_400
rel_400
That's more files, but as discussed above, this allows storing more
up-to-date page versions on disk, which should reduce the latency of
responding to a GetPage request. It also allows more fine-grained garbage
collection. In the above example, after the old page version are no longer
needed and if the relation is not modified anymore, we only need to keep
the latest image file, 'rel_400', and everything else can be removed.
Implements https://github.com/zenithdb/zenith/issues/339
Now that we only have one Repository implementation, no need for the
command-line options to choose it either. I'm removing these as a separate
commit to show what we will need to do if we add another Repository
implementation in the future (even though I don't foresee us doing that
any time soon)
The layered storage format is good enough that we don't need the rocksdb
implementation anymore. There are a lot of known issues but we'll keep
working on them.
Now that the new storage format is based on immutable files, we want to
implement push/pull in terms of these immutable files as well. Similarly
to how those files will be transferred between S3 and the page server.
The implementation we had was fairly tightly coupled with the object
repository implementation, but I'm about to remove the object / rocksdb
storage format soon. That would leave the current "zenith push" command
completely broken.
It seemed like a good idea at the time, but in hindsight, it was premature
to implement push/pull yet. It's a nice feature and I'd like to see it
reimplemented in the future, but in the meanwhile, let's remove the code
we had. We can dig the parts of it that might be useful in the future
from the git history.
The old policy was to flush all in-memory layers to disk every 10 seconds.
That was a pretty dumb policy, unnecessarily aggressive. This commit
changes the policy so that we only flush layers where the oldest WAL
record is older than 16 MB from the last valid LSN on the timeline. That's
still pretty aggressive, but it's a step in the right direction. We do
need a limit on how old the oldest in-memory layer is allowed to be,
because that determines how much WAL the safekeepers need to hold onto,
and how much WAL we need to reprocess in case of a page server crash.
16 MB is surely still too aggressive for that, but it's easy to change
the setting later.
To support that, keep all in-memory layers in a binary heap, so that we
can easily find the one with the oldest LSN.
This tracks and a new LSN value in the metadata file: 'disk_consistent_lsn'.
Before, on page server restart we restarted the WAL processing from the
'last_record_lsn' value, but now that we don't flush everything to disk in
one go, the 'last_record_lsn' tracked in memory is usually ahead of the
last record that's been flushed to disk. Even though we track that oldest
LSN now, the crash recovery story isn't really complete. We don't do
fsync()s anywhere, and thing will break if a snapshot file isn't complete,
as there's no CRC on them. That's not new, and it's a TODO.
Upgrade to bindgen 0.59, which has two new abilities:
- specify arbitrary #[derive] attributes to attach to generated structs
- request explicit padding fields
These two features are enough to replace transmute with serde/bincode.
Because the t_cid field was missing from the XlHeapDelete struct that
corresponds to the PostgreSQL xl_heap_delete struct, the check for the
XLH_DELETE_ALL_VISIBLE_CLEARED flag did not work correctly.
Decoding XlHeapUpdate struct was also missing the t_cid field, but that
didn't cause any immediate problems because in that struct, the t_cid
field is after all the fields that the page server cares about. But fix
that too, as it was an accident waiting to happen.
The bug was mostly hidden by the VM page handling in zenith_wallog_page,
where it forcibly generates a FPW record whenever a VM page is evicted:
else if (forknum == VISIBILITYMAP_FORKNUM && !RecoveryInProgress())
{
/*
* Always WAL-log vm.
* We should never miss clearing visibility map bits.
*
* TODO Is it too bad for performance?
* Hopefully we do not evict actively used vm too often.
*/
XLogRecPtr recptr;
recptr = log_newpage_copy(&reln->smgr_rnode.node, forknum, blocknum, buffer, false);
XLogFlush(recptr);
lsn = recptr;
But that was just hiding the issue: it's still visible if you had a
read-only node relying on the data in the page server, or you killed and
restarted the primary node, or you started a branch. In the included test
case, I used a new branch to expose this.
Fixes https://github.com/zenithdb/zenith/issues/461
- Move source tree overview into separate docs/sourcetree.md and update it.
- Add glossary: docs/glossary.md
- Add a draft of Architecture overview to main Readme.md
There can be only one "open" layer for each segment. That's the last one,
implemented by InMemoryLayer. That's the only one where new records can
be appended to. Much of the code needed to distinguish between the last
open layer and other layers anyway, so make the distinction explicit
in LayerMap.
There was a a lot of duplicated code between the get_page_at_lsn()
implementations in InMemoryLayer and SnapshotLayer. Move the code for
requesting WAL redo from the Layer trait into LayeredTimeline. The
get-function in Layer now just returns the WAL records and base image
to the caller, and the caller is responsible for performing the WAL
redo on them.
Split each relish into fixed-sized 10 MB segments. Separate layers are
created for each segment. This reduces the write amplification if you
have a large relation and update only parts of it; the downside is
that you have a lot more files. The 10 MB is just a guess, we should
do some modeling and testing in the future to figure out the optimal
size.
Each segment tracks the size of the segment separately. To figure out
the total size of a relish, you need to loop through the segment to
find the highest segment that's in use. That's a bit inefficient, but
will do for now. We might want to add a cache or something later.
Change CLI so that we always create node from scratch at 'pg start'.
This operation preserve previously existing config
Add new flag '--config-only' to 'pg create'.
If this flag is passed, don't perform basebackup, just fill initial postgresql.conf for the node.
Track the time spent on replaying WAL records by the special Postgres
process, the time spent waiting for acces to the Postgres process (since
there is only one per tenant), and the number of records replayed.
This replaces the RocksDB based implementation with an approach using
"snapshot files" on disk, and in-memory btreemaps to hold the recent
changes.
This make the repository implementation a configuration option. You can
choose 'layered' or 'rocksdb' with "zenith init --repository-format=<format>"
The unit tests have been refactored to exercise both implementations.
'layered' is now the default.
Push/pull is not implemented. The 'test_history_inmemory' test has been
commented out accordingly. It's not clear how we will implement that
functionality; probably by copying the snapshot files directly.
Most of the work here was done on the postgres side. There's more
information in the commit message there.
(see: 04cfa326a5)
On the WAL acceptor side, we're now expecting 'START_WAL_PUSH' to
initialize the WAL keeper protocol. Everything else is mostly the same,
with the only real difference being that protocol messages are now
discrete CopyData messages sent over the postgres protocol.
For the sake of documentation, the full set of these messages is:
<- recv: START_WAL_PUSH query
<- recv: server info from postgres (type `ServerInfo`)
-> send: walkeeper info (type `SafeKeeperInfo`)
<- recv: vote info (type `RequestVote`)
if node id mismatch:
-> send: self node id (type `NodeId`); exit
-> send: confirm vote (with node id) (type `NodeId`)
loop:
<- recv: info and maybe WAL block (type `SafeKeeperRequest` + bytes)
(break loop if done)
-> send: confirm receipt (type `SafeKeeperResponse`)
My main motivation is to make it easier to attribute time spent in WAL
redo to the request that needed the WAL redo. With this patch, the WAL
redo is performed by the requester thread, so it shows up in stack traces
and in 'perf' report as part of the requester's call stack. This is also
slightly simpler (less lines of code) and should be a bit faster too.
The upcoming layered storage implementation handles GC as a
repository-wide operation because it needs to pay attention to the branch
points of all timelines.
On my laptop, the server was receiving the token as a string with extra
b'...' escaping, e.g as "b'eyJ0....0ifQA'" instead of just "eyJ0....0ifQA".
That was causing the test to fail.
I'm using Python 3.9, while the CI is using Python 3.8. I suspect that's
why. My version of pyjwt might be different too.
See also https://github.com/jpadilla/pyjwt/issues/391.
They represent files and use RelationSizeEntry to track existing and dropped files.
They can be both blocky and non-blocky.
get_relish_size() and get_rel_exists() functions work with physical relishes, not only with blocky ones.
Follow PostgreSQL logic: remove Twophase files when prepared transaction is committed/aborted.
Always store Twophase segments as materialized page images (no wal records).
Current state with authentication.
Page server validates JWT token passed as a password during connection
phase and later when performing an action such as create branch tenant
parameter of an operation is validated to match one submitted in token.
To allow access from console there is dedicated scope: PageServerApi,
this scope allows access to all tenants. See code for access validation in:
PageServerHandler::check_permission.
Because we are in progress of refactoring of communication layer
involving wal proposer protocol, and safekeeper<->pageserver. Safekeeper
now doesn’t check token passed from compute, and uses “hardcoded” token
passed via environment variable to communicate with pageserver.
Compute postgres now takes token from environment variable and passes it
as a password field in pageserver connection. It is not passed through
settings because then user will be able to retrieve it using pg_settings
or SHOW ..
I’ve added basic test in test_auth.py. Probably after we add
authentication to remaining network paths we should enable it by default
and switch all existing tests to use it.
Server functionality requires not only the "server" feature flag, but
also either "http1" or "http2" (or both). To make things simpler
(and prevent analogous problems), enable all features.
Current GC test is flaky and overly strict. Since we are migrating to the layered repo format
with different GC implementation let's just silence this test for now.
This patch has been extracted from #348, where it became unnecessary
after we had decided that we didn't want to measure anything inside
PostgresBackend.
IMO the change is good enough to make its way into the codebase,
even though it brings nothing "new" to the code.
The metrics are served by an http endpoint, which
is meant to be spawned in a new thread.
In the future the endpoint will provide more APIs,
but for the time being, we won't bother with proper routing.
This clarifies - I hope - the abstractions between Repository and
ObjectRepository. The ObjectTag struct was a mix of objects that could
be accessed directly through the public Timeline interface, and also
objects that were created and used internally by the ObjectRepository
implementation and not supposed to be accessed directly by the
callers. With the RelishTag separaate from ObjectTag, the distinction
is more clear: RelishTag is used in the public interface, and
ObjectTag is used internally between object_repository.rs and
object_store.rs, and it contains the internal metadata object types.
One awkward thing with the ObjectTag struct was that the Repository
implementation had to distinguish between ObjectTags for relations,
and track the size of the relation, while others were used to store
"blobs". With the RelishTags, some relishes are considered
"non-blocky", and the Repository implementation is expected to track
their sizes, while others are stored as blobs. I'm not 100% happy with
how RelishTag captures that either: it just knows that some relish
kinds are blocky and some non-blocky, and there's an is_block()
function to check that. But this does enable size-tracking for SLRUs,
allowing us to treat them more like relations.
This changes the way SLRUs are stored in the repository. Each SLRU
segment, e.g. "pg_clog/0000", "pg_clog/0001", are now handled as a
separate relish. This removes the need for the SLRU-specific
put_slru_truncate() function in the Timeline trait. SLRU truncation is
now handled by caling put_unlink() on the segment. This is more in
line with how PostgreSQL stores SLRUs and handles their trunction.
The SLRUs are "blocky", so they are accessed one 8k page at a time,
and repository tracks their size. I considered an alternative design
where we would treat each SLRU segment as non-blocky, and just store
the whole file as one blob. Each SLRU segment is up to 256 kB in size,
which isn't that large, so that might've worked fine, too. One reason
I didn't do that is that it seems better to have the WAL redo
routines be as close as possible to the PostgreSQL routines. It
doesn't matter much in the repository, though; we have to track the
size for relations anyway, so there's not much difference in whether
we also do it for SLRUs.
While working on this, I noticed that the CLOG and MultiXact redo code
did not handle wraparound correctly. We need to fix that, but for now,
I just commented them out with a FIXME comment.
The codepath for tenant_create command first launched the WAL redo
thread, and then called branches::create_repo() which checked if the
tenant's directory already exists. That's problematic, because
launching the WAL redo thread will run initdb if the directory doesn't
already exist. Race condition: If the tenant already exists, it will
have a WAL redo thread already running, and the old and new WAL redo
thread might try to run initdb at the same time, causing all kinds of
weird failures.
The test_pageserver_api test was failing 100% repeatably on my laptop
because of this. I'm not sure why this doesn't occur on the CI:
Jul 31 18:05:48.877 INFO running initdb in "./tenants/5227e4eb90894775ac6b8a8c76f24b2e/wal-redo-datadir", location: pageserver::walredo, pageserver/src/walredo.rs:483
thread 'WAL redo thread' panicked at 'initdb failed: The files belonging to this database system will be owned by user "heikki".
This user must also own the server process.
The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory ./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Europe/Helsinki
creating configuration files ... ok
running bootstrap script ...
stderr:
2021-07-31 15:05:48.875 GMT [282569] LOG: could not open configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf": No such file or directory
2021-07-31 15:05:48.875 GMT [282569] FATAL: configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf" contains errors
child process exited with exit code 1
initdb: removing data directory "./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir"
- Add new subdir postgres_ffi/samples/ for config file samples.
- Don't copy wal to the new branch on zenith init or zenith branch.
- Import_timeline_wal on zenith init.
It was pretty cool, but no one used it, and it had gotten badly out of
date. The main interesting thing with it was to see some basic metrics
on the fly, while the page server is running, but the metrics collection
had been broken for a long time, too. Best to just remove it.
this patch adds support for tenants. This touches mostly pageserver.
Directory layout on disk is changed to contain new layer of indirection.
Now path to particular repository has the following structure: <pageserver workdir>/tenants/<tenant
id>. Tenant id has the same format as timeline id. Tenant id is included in
pageserver commands when needed. Also new commands are available in
pageserver: tenant_list, tenant_create. This is also reflected CLI.
During init default tenant is created and it's id is saved in CLI config,
so following commands can use it without extra options. Tenant id is also included in
compute postgres configuration, so it can be passed via ServerInfo to
safekeeper and in connection string to pageserver.
For more info see docs/multitenancy.md.
It used to be the case that walkeeper's background thread
failed to recognize the end of stream (EOF) signaled by the
`Ok(None)` result of `FeMessage::read`.
* Introducing common enum ObjectVal for all values
* Rewrite push mechanism to use raw object copy
* Fix history unit test
* Add skip_nonrel_objects functions for history unit tests
It removes remaining issues with running cargo audit. There was one
error and one warning:
Crate: tokio
Version: 1.5.0
Title: Task dropped in wrong thread when aborting `LocalSet` task
Date: 2021-07-07
ID: RUSTSEC-2021-0072
URL: https://rustsec.org/advisories/RUSTSEC-2021-0072
Solution: Upgrade to >=1.5.1, <1.6.0 OR >=1.6.3, <1.7.0 OR >=1.7.2, <1.8.0 OR >=1.8.1
Crate: cpuid-bool
Version: 0.1.2
Warning: unmaintained
Title: `cpuid-bool` has been renamed to `cpufeatures`
Date: 2021-05-06
ID: RUSTSEC-2021-0064
URL: https://rustsec.org/advisories/RUSTSEC-2021-0064
When executed, pipenv shell creates a fresh Pipfile if none
is found in the current directory. This is confusing,
hence the patch to symlink it at the top level, which
is a good starting point for various commands.
Based on Konstantin's original patch (PR #275), but I introduced helper
functions for serializing/deserializing the different kinds of
ObjectValues, which made it more pleasant to use, as the deserialization
checks are now performed in the helper functions.
Add back code to parse transaction commit and abort records, and in
particular the list of dropped relations in them. Add 'put_unlink'
function to the Timeline trait and implementation. We had the code to
handle dropped relations in the GC code and elsewhere in ObjectRepository
already, but there was nothing to create the RelationSizeEntry::Unlink
tombstone entries until now. Also add a test to check that GC correctly
removes all page versions of a dropped relation.
Implements https://github.com/zenithdb/zenith/issues/232, except for the
"orphaned" rels.
Reviewed-by: Konstantin Knizhnik
Some of these were related to handling various WAL records that are not
related to any relations, like pg_multixact updates. These should have
been removed in the revert commit 6a9c036ac1, but I missed them.
Also, we didn't anything with commit/abort records. We will start
parsing commit/abort records in the next commit, but seems better to
add that from clean slate.
Reviewed-by: Konstantin Knizhnik
Without this step, the page versions won't actually be removed, they're
just marked for deletion on the next RocksDB "merge" or "compact"
operation.
Author: Konstantin Knizhnik
- Print the number of dropped relations, and the number of relations
encountered overall.
- If a block has only one page version, the latest one, don't count it as
a "truncated" version history. Only count pages for which we actually
removed some old versions.
- Change "last" to "latest" in variable names and comments. "Last" could
be interpreted as "oldest", but here it means "newest".
- Add a comment noting that the GC code depends on get_page_at_lsn_nowait
to store the materialized page version in the repository.
- Change "last" to "latest" in variable names for clarity. "Last" could
be interpreted as the oldest, but here it means newest.
This patch aims to:
* Unify connection & querying logic of ZenithPagerserver and Postgres.
* Mitigate changes to transaction machinery introduced in `psycopg2 >= 2.9`.
Now it's possible to acquire db connection using the corresponding
method:
```python
pg = postgres.create_start('main')
conn = pg.connect()
...
conn.close()
```
This pattern can be further improved with the help of `closing`:
```python
from contextlib import closing
pg = postgres.create_start('main')
with closing(pg.connect()) as conn:
...
```
All connections produced by this method will have autocommit
enabled by default.
The clippy maintainers have not provided an easy way for projects to
configure the set of lints they would like enabled/disabled. It's
particularly bad for projects using workspaces, which can easily lead to
duplicated clippy annotations for every crate, library, binary, etc.
Add a shell script that runs clippy, with a few unhelpful lints
disabled:
new_without_default
manual_range_contains
comparison_chain
If you save this in your path under the name "cargo-zclippy" (or
whatever name you like), then you can run it as "cargo zclippy" from the
shell prompt. If your text editor has rust-analyzer integration, you can
also use this new command as a replacement for "cargo check" or "cargo
clippy" and see clippy warnings and errors right in the editor.
To simplify cloud ops, allow configuration via file.
toml is used as the config format, and the file is stored in the working
directory.
Arguments used at initialization are saved in the config file.
Config file params may be overridden by CLI arguments.
Use CLI args instead of environment variables to parameterize the
working directory and postgres distirbution.
Before this change, there was a mixture of environment variables and CLI
arguments that needed to be set. Moving to a single input simplifies
cloud configuration management.
The bool::then function was added in Rust 1.50. I'm still using 1.48 on
my laptop. We haven't decided what Rust version we will require
(https://github.com/zenithdb/zenith/issues/138), and I'll probably need
to upgrade sooner or later, but this will do for now.
It's not realistic to enable full-blown type checks
within test_runner's codebase, since the amount of
warnings revealed by mypy is overwhelming.
Tests are supposed to be easy to use, so we can't
cripple everybody's workflow for the sake of imaginary benefit.
Ultimately, the purpose of this attempt is three-fold:
* Facilitate code navigation when paired with python-language-server.
* Make method signatures apparent to a fellow programmer.
* Occasionally catch some obvious type errors.
Clear a clippy warning about manual flatten.
This isn't good error handling, but panicking is probably better than
spinning forever if stdin returns EOF.
Previously, transaction commit could happen regardless of whether
pageserver has caught up or not. This patch aims to fix that.
There are two notable changes:
1. ComputeControlPlane::new_node() now sets the
`synchronous_standby_names = 'pageserver'` parameter to delay
transaction commit until pageserver acting as a standby has
fetched and ack'd a relevant portion of WAL.
2. pageserver now has to:
- Specify the `application_name = pageserver` which matches the
one in `synchronous_standby_names`.
- Properly reply with the ack'd LSNs.
This means that some tests don't need sleeps anymore.
TODO: We should probably make this behavior configurable.
Fixes#187.
Now postgres_backend communicates with the client, passing queries to the
provided handler; we have two currently, for wal_acceptor and pageserver.
Now BytesMut is again used for writing data to avoid manual message length
calculation.
ref #118
Note the unsafety of the unsafe block, with a link to the ongoing
discussion. This doesn't try to solve the problem, but let's at least
document the status quo.
That is mandatory to correctly maintain visibility map (see issue#192).
It also makes sense to check that wal_log_hints is enabled at the pageserver side,
but for now let just check that tests will pass with this on.
- All timelines are now stored in the same rocksdb repository. The GET
functions have been taught to follow the ancestors.
- Change the way relation size is stored. Instead of inserting "tombstone"
entries for blocks that are truncated away, store relation size as
separate key-value entry for each relation
- Add an abstraction for the key-value store: ObjectStore. It allows
swapping RocksDB with some other key-value store easily. Perhaps we
will write our own storage implementation using that interface, or
perhaps we'll need a different abstraction, but this is a small
improvement over status quo in any case.
- Garbage Collection is broken and commented out. It's not clear where and
how it should be implemented.
There was no guarantee that the SELECT FOR KEY SHARE queries actually
run in parallel. With unlucky timing, one query might finish before
the next one starts, so that the server doesn't need to create a
multixact. I got a failure like that on the CI:
batch_others/test_multixact.py:56: in test_multixact
assert(int(next_multixact_id) > int(next_multixact_id_old))
E AssertionError: assert 1 > 1
E + where 1 = int('1')
E + and 1 = int('1')
This could be reproduced by adding a random sleep in the runQuery
function, to make each query run at different times.
To fix, keep the transactions open after running the queries, so that
they will surely be open concurrently. With that, we can run the
queries serially, and don't need the 'multiprocessing' module anymore.
Fixes https://github.com/zenithdb/zenith/issues/196
Move `save_decoded_record` out of the Timeline trait. The storage
implementation shouldn't need to know how to decode records.
Also move put_create_database() out of the Timeline trait. Add a new
`list_rels` function to Timeline to support it, instead.
Rename `get_relsize` to `get_rel_size`, and `get_relsize_exists` to
`get_rel_exists`. Seems nicer.
Add build dependencies and other local packages needed (Ubuntu only).
Fix some weird formatting of psql commands due to `sh` syntax
highlighting.
Improve test directions, so pytest doesn't scan the whole tree.
Drop description of the integration_tests directory since it's on its
way out.
Derive Serialize+Deserialize for RelTag, BufferTag, CacheKey. Replace
handwritten pack/unpack functions with ser, des from
zenith_utils::bin_ser (which uses the bincode crate).
There are some ugly hybrids in walredo.rs, but those functions are
already doing a lot of questionable manual byte-twiddling, so hopefully
the weirdness will go away when we get better postgres protocol
wrappers.
- Previously, we checked on first use of a timeline, whether there is
a snapshot and WAL for the timeline, and loaded it all into the
(rocksdb) repository. That's a waste of effort if we had done that
earlier already, and stopped and restarted the server. Track the
last LSN that we have loaded into the repository, and only load the
recent missing WAL after that.
- When you create a new zenith repository with "zenith init",
immediately load the initial empty postgres cluster into the rocksdb
repository. Previously, we only did that on the first connection. This
way, we don't need any "load from filesystem" codepath during normal
operation, we can assume that the repository for a timeline is always
up to date. (We might still want to use the functionality to import an
existing PostgreSQL data directory into the repository in the future,
as a separate Import feature, but not today.)
This includes the following commits:
35a1c3d521 Specify right LSN in test_createdb.py
d95e1da742 Fix issue with propagation of CREATE DATABASE to the branch
8465738aa5 [refer #167] Fix handling of pg_filenode.map files in page server
86056abd0e Fix merge conflict: set initial WAL position to second segment because of pg_resetwal
2bf2dd1d88 Add nonrelfile_utils.rs file
20b6279beb Fix restoring non-relational data during compute node startup
06f96f9600 Do not transfer WAL to computation nodes: use pg_resetwal for node startup
As well as some older changes related to storing CLOG and MultiXact data as
"pseudorelation" in the page server.
With this revert, we go back to the situtation that when you create a
new compute node, we ship *all* the WAL from the beginning of time to
the compute node. Obviously we need a better solution, like the code
that this reverts. But per discussion with Konstantin and Stas, this
stuff was still half-baked, and it's better for it to live in a branch
for now, until it's more complete and has gone through some review.
For a CI build, storing incremental build data just makes the cache
bigger, for minimal gain.
Also, for Rust < 1.52.1 there are incremental compilation bugs. CircleCI
is currently building on 1.51.
This only affects the debug build; incremental compilation isn't used on
the release build.
Add parameters to specify which kind of build; run a debug and release
variant for each job.
Eventually this will be too many jobs, but for now this is a nice start.
Also, bump the cache string to "v02" so we don't mix up our cache output
with other branches.
Parse all the command line options before calling "zenith init" and
changing current working dir. The rest of the options don't make any
difference if we're initializing a new repository, but it seems strange
and error-prone to parse some arguments at different times.
* Add ancestor_id to pg_list->branch_list output of pageserver.
* Display branching point (LSN) for each non-root branch.
* Add tests for `zenith branch`.
It's created once early in server startup, after parsing the
command-line options, and never modified afterwards. To simplify
things, pass it around as static ref, instead of making copies in all
the different structs. We still pass around a reference to it, rather
than putting it in a global variable, to allow unit testing with
different configs in the same process.
Move ReceiveWalConn into its own file. Shuffle constants around so they
are close to the protocol they're associated with, or move them into
postgres_ffi if they seem to be global constants.
It may be more robust to use the TcpStream::peek function, so do all
protocol peeking before creating the protocol object. This reveals the
next cleanup step: rename Connection, since it's no longer the parent of
SendWalConn. Now we peek at the first bytes and choose which kind of
connection object to create.
We're starting to deserialize directly from the TcpStream now, which
means that a socket error gets logged as "deserialize error". That's not
very helpful; preserve the io::Error so it can be logged.
Serialize objects directly to the stream. This allows us to remove a
bunch of buffer management code, along with the NewSerializer trait that
was a temporary bridge between the old code and the new.
The pieces are:
base Connection
SendWal
ReplicationHandler
There are lots of other changes here:
- Put the replication reader in a background thread; this gets rid
of some hacks with nonblocking mode.
- Stop manually buffering input data; use BufReader instead.
- Use BytesMut a lot less; use Read/Write traits where possible.
If we try to read a few bytes at a time, we will perform a lot more
syscalls than necessary. Wrap the socket in a BufReader, which will
buffer bytes as needed.
libpq tolerates and ignores them, but the Rust postgres client gets
confused by them in certain states. This explained the strange failure
I saw with the Copy Out protocol. I'm not sure what the condition was
exactly, but somehow the rust client got confused if it received a
ReadyForQuery message that it was not expecting.
Fixes https://github.com/zenithdb/zenith/issues/148.
Compiling a Regex is very expensive, so let's not do it on every
invocation. This was consuming a big fraction of the time in creating
a new base backup at "zenith pg create". This commits brings down the
time to run "zenith pg create" on a freshly created repository from
about 2 seconds to 1 second.
It's not worth spending much effort on optimizing things at this stage
in general, but might as well pick low-hanging fruit like this.
I'm going nuts with the pattern:
let k = iter.key().unwrap();
buf.clear();
buf.extend_from_slice(&k);
let key = CacheKey::unpack(&mut buf);
Introduce helper functions to convert a CacheKey into BytesMut, and
from [u8] into CacheKey. Reduces the boilerplate code a lot.
The helper functions create a new BytesMut on each call, whereas the old
coding could reuse a single BytesMut, so this could be a bit slower. I
haven't tried measuring it, but at least it's not immediately noticeable,
and readability is much more imporatant at this point. We can optimize
later
This isn't very exciting with the current RocksDB implementation, because
it doesn't care about the PostgreSQL 1 GB segment boundaries at all.
But I think we will care about this in the future, and more tests is
generally better anyway.
Commit 746f667311 added the 'workdir' field and the get_*_path()
functions, with the idea that we cd into the directory at page server
startup, so that the get_*_path() functions can always return paths
relative to '.', but 'workdir' shows the original path to it. Change it
so that 'conf.workdir' is always set to '.', too, and the get_*_path()
functions include 'workdir' in the returned paths. Why? Because that
allows writing unit tests without changing the current directory.
When I was working on commit 97992226d3, I initially wrote the test so
that it changed the current working directory, just like commit 746f667311
did. But that was problematic, when I tried to add another unit test that
*also* wants to change the current working dir, because they could then
not run concurrently. In fact, they could not even run serially, unless
the current directory was carefully reset after the test. So it is better
to avoid changing the current directory in tests.
Commit 746f667311 moved the "chdir" earlier in the startup sequence,
before daemonizing. But it forgot to remove a corresponding chdir call
later in the sequence when not in daemonize mode. As a result, if you
tried to start the pageserver without the --daemonize option, it always
failed with "No such file or directory" error.
We have coverage for these things in the python tests, we don't need both.
test_redo_cases() was a pretty simple case that created a couple of
table and inserted to them. We don't have another test exactly like
that, but there is enough similar stuff in the test_branch_behind and
test_pgbench tests to cover it.
test_regress() and pgbench() are redundant with the test_pg_regress and
test_pgbench python tests.
test_pageserver_two_timelines() is similar enough to the test_branch_behind
test that we don't need it. And many other tests create branches, too.
In 746f667 I "optimized" wal_acceptor tests by setting "--pageserver"
flag only on one of wal_acceptors. Which obviously will hang the system if
that wal_acceptors is down. And test_acceptors_restarts does exctly this.
Set "--pageserver" on all wal_acceptors as it was before.
- The 'pageserver' fixture now sets up the repository and starts up
the Page Server automatically. In other words, the 'pageserver'
fixture provides a Page Server that's up and running and ready to
use in tests.
- The 'pageserver' fixture now also creates a branch called 'empty',
right after initializing the repository. By convention, all the
tests start by createing a new branch off 'empty' for the test. This
allows running all the tests against the same Page Server
concurrently. (I haven't tested that though. pytest doensn't
provide an option to run tests in parallel but there are extensions
for that.)
- Remove the 'zen_simple' fixture. Now that 'pageserver' provides
server that's up and running, it's pretty simple to use the
'pageserver' and 'postgres' fixtures directly.
- Don't assume host name or ports in the tests. They now use the
fields in the fixtures for that. That allows assigning the ports
dynamically, making it possible to run multiple page servers in
parallel, or running the tests in parallel with another page
server. This commit still hard codes the Page Server's port in the
fixture, though, so more work is needed to actually make it
possible.
- I made some changes to the 'postgres' fixture in commit 532918e13d,
which broke the other tests. Fix them.
- Divide the tests into two "batches" of roughly equal runtime, which
can be run in parallel
- Merge the 'test_file' and 'test_filter' options in CircleCI config
into one 'test_selection' option, for simplicity.
This patch started as an effort to support CLI working against remote
pageserver, but turned into a pretty big refactoring.
* CLI now does not look into repository files directly. New commands
'branch_create' and 'identify_system' were introduced into page_service to
support that.
* Branch management that was scattered between local_env and
zenith/main.rs is moved into pageserver/branches.rs. That code could better fit
in Repository/Timeline impl, but I'll leave that for a different patch.
* All tests-related code from local_env went into integration_tests/src/lib.rs as an
extension to PostgresNode trait.
* Paths-generating functions were concentrated around corresponding config
types (LocalEnv and PageserverConf).
When creating a new branch, we copied all WAL from the source timeline
to the new one, and it was being picked up and digested into the
repository on first use of the timeline. Fix by copying the WAL only
up to the branch's starting point.
We should probably move the branch-creation code from the CLI to page
server itself - that's what I was starting to hack on when I noticed this
bug - but let's fix this first.
Add a regression test. To test multiple branches, enhance the python
test fixture to manage multiple running Postgres instances. Also, for
convenience, add a function to the postgres fixture to open a connection
to the server with psycopg2.
This isn't just cosmetic, this also fixes one bug: the code in
parse_point_in_time() function used str::parse::<u64>() to parse the
parts of the LSN string (e.g. 0/1A2B3C4D). That's wrong, because the
LSN consists of hex digits, not base-10.
Commit 9ece1e863d used `slice.fill`, which isn't available until Rust
v1.50.0. I have 1.48.0 installed, so it was failing to compile for me.
We haven't really standardized on any particular Rust version, and if
there's a good feature we need in a recent version, let's bump up the
minimum requirement. But this is simple enough to work around.
Turn WalRedoManager into an abstract trait, so that it can be easily
mocked in unit tests.
One change here is that the WAL redo manager is no longer tied to a
specific zenith timeline. It didn't do anything with that information
aside from using it in the dummy datadir's name. We could use any
random string for that purpose, it's just to prevent two WAL redo
managers from stepping over each other. But this commit actually
changes things so that all timelines use the same WAL redo manager, so
that's not necessary. We will probably want to maintain a pool of WAL
redo processes in the future, but for now let's keep it simple.
In the passing, fix some comments.
We used to create them under .zenith/.zenith/<timelineid>. The double
.zenith was clearly not intentional. Change it to
.zenith/timelines/<timelineid>.
Fixes https://github.com/zenithdb/zenith/issues/127
It's a bit annoying that the .zenith state can show up in multiple
places, but since this is how the regression tests run if you launch
them from the git root directory, ignore this one too.
The module comment should use "//!" instead of "///". Otherwise, it is
considered to apply to the *next* thing, in this case the "use" statement
that follows, not the file as whole. "cargo fmt" revealed this by insisting
to move the "use crate::pg_constants" line to before the comment.
Multiple fds writing to the same file doesn't work. One fd will
overwrite the output of the other fd. We were opening log files three
times (stdout, stderr, and slog).
The symptoms can be seen when the program panics; the final file will
have truncated or lost messages. After this change, all messages are
preserved. If panicking and logging are concurrent (and they definitely
can be), some of the messages may be interleaved in slightly
inconvenient ways.
File::try_clone() is essentially `dup` underneath, meaning the two will
share the same file offset.
If timeline doesn't have a valid "last valid LSN", refuse WAL streaming.
The previous behavior was to start streaming from the very beginning of
time. That was needed to support bootstrapping the page server with no
data at all (see commit bd606ab37a), but we no longer do that.
Somehow I never learned this part correctly: relative imports use the
syntax "import .file" for a file sitting in the same directory.
This error wasn't terribly obvious, but the Pylance linter is yelling at
me so I'll fix it now before anyone else notices.
I didn't think this mattered, but it does: if you add a dependency to
zenith_utils, but forget to request a feature you need, the crate will
build from the workspace root, but not by itself.
It's probably better to pull in the whole dependency tree.
This leaves one problem unsolved: the missing feature above will now be
a latent bug. If that feature gets removed later by other crates, and
then the workspace_hack Cargo.toml is updated, this missing feature will
become a build failure.
Add fixes suggested in code review.
In a previous commit, I changed the NodeId field order and types to try
to preserve the exact serialization that was happening. Unfortunately,
that serialization was incorrect and the original struct was mostly
correct.
Change uuid to be a [u8; 16] as it was intended to be a byte array; that
will clearly indicate to serde serializers that no endian swaps will
ever be needed.
Replace XLogRecPtr with Lsn in wal_service.rs .
This removes the last use of XLogSegmentOffset and XLByteToSeg, so
delete them. (replaced by Lsn::segment_offset and Lsn::segment_number.)
The C memory representation is only needed if we want to guarantee the
same memory layout as some other program. Since we're using serde to
serialize these data structures, we can let the compiler do what it
wants.
This version validates on every call that our result is exactly the same
as the previous result.
NodeId is a strange corner case: one field is serialized little-endian
and one field is serialized big-endian. Hopefully we can fix that in the
future.
This module adds two traits that implement bincode-based serialization.
BeSer implements methods for big-endian encoding/decoding.
LeSer implements methods for little-endian encoding/decoding.
Right now, the BeSer and LeSer methods have the same names, meaning you
can't `use` them both at the same time. This is intended to be a safety
mechanism: mixing big-endian and little-endian encoding in the same file
is error-prone. There are ways around this, but the easiest fix is to
put the big-endian code and little-endian code in different files or
submodules.
This is a big async -> sync conversion. Most of it is a pretty
straightforward conversion of removing `async` and `.await` and swapping
in the right std modules.
I didn't find a thread-blocking version of `Notify` so I wrote one, and
then realized that there was already a Mutex being used there, so I
deleted my Notify and just used Condvar instead.
There is one part that seems odd to me: in `handle_start_replication`
there is a place where the previous code was doing a non-blocking read;
there is no TcpStream::try_read() so I fell back on manually flipping
the socket to non-blocking mode and then back again. This seems pretty
gross, but I'm not sure exactly what to replace this with: a background
thread? Extract the fd and run select() on it to first test if it's
readable?
Switch over to a newer version of rust-postgres PR752. A few
minor changes are required:
- PgLsn::UNDEFINED -> PgLsn::from(0)
- PgTimestamp -> SystemTime
Our builds can be a little inconsistent, because Cargo doesn't deal well
with workspaces where there are multiple crates which have different
dependencies that select different features. As a workaround, copy what
other big rust projects do: add a workspace_hack crate.
This crate just pins down a set of dependencies and features that
satisfies all of the workspace crates.
The benefits are:
- running `cargo build` from one of the workspace subdirectories now
works without rebuilding anything.
- running `cargo install` works (without rebuilding anything).
- making small dependency changes is much less likely to trigger large
dependency rebuilds.
A few things that Eric commented on at PR #96:
- Use thiserror to simplify the implemention of FilePathError
- Add unit tests
- Fix a few complaints from clippy
We should track the range of LSNs that are valid in a GetPage@LSN request
somehow, but currently this is just dead code. Remove, until we get around
to actually implement it.
https://github.com/zenithdb/zenith/issues/95 tracks that.
Currently, truncation is implemented in the RocksDB repository by storing
a special sentinel entry for each page that was truncated away. Hide that
implementation detail better in the abstract Repository interface, so
that caller doesn't need to construct the special sentinel WAL record.
While we're at it, refactor the CacheEntryContent struct to an enum.
This moves things around:
- The PageCache is split into two structs: Repository and Timeline. A
Repository holds multiple Timelines. In order to get a page version,
you must first get a reference to the Repository, then the Timeline
in the repository, and finally call the get_page_at_lsn() function
on the Timeline object. This sounds complicated, but because each
connection from a compute node, and each WAL receiver, only deals
with one timeline at a time, the callers can get the reference to
the Timeline object once and hold onto it. The Timeline corresponds
most closely to the old PageCache object.
- Repository and Timeline are now abstract traits, so that we can
support multiple implementations. I don't actually expect us to have
multiple implementations for long. We have the RocksDB
implementation now, but as soon as we have a different
implementation that's usable, I expect that we will retire the
RocksDB implementation. But I think this abstraction works as good
documentation in any case: it's now easier to see what the interface
for storing and loading pages from the repository is, by looking at
the Repository/Timeline traits. They abstract traits are in
repository.rs, and the RocksDB implementation of them is in
repository/rocksdb.rs.
- page_cache.rs is now a "switchboard" to get a handle to the
repository. Currently, the page server can only handle one
repository at a time, so there isn't much there, but in the future
we might do multi-tenancy there.
Mostly we're not testing python code, so verbose python tracebacks are
unhelpful. Add --tb=short to the pytest args to cut down on the noise.
To override this during testing, set the "extra_params" parameter on the
circleci job to "--tb=auto" or "--tb=long".
The local fork of rust-s3 has some code to support Google Cloud, but
that PR no longer applies upstream, and will need significant changes
before it can be re-submitted.
In the meantime, we might as well just use the most similar upstream
release. The benefit of switching is that it fixes a feature-resolution
bug that was causing us to build 24 more crates than needed (mostly
async-std and its dependencies).
The rust cache is growing dramatically. Change the cache key to start
over.
The weird "v98" was something I'd intended to reset before landing the
circleci config. Do the sane thing and start over at v01. The intent is
that we just increment the number each time something gets broken.
Since we are now calling the syscall directly, read_pidfile can now
parse an integer.
We also verify the pid is >= 1, because calling kill on 0 or negative
values goes straight to crazytown.
default(medium): 2 CPUs, 4GB RAM.
xlarge: 8 CPUs, 16GB RAM.
Some build jobs are getting killed with signal 9. I'm guessing that this
is probably an OOM condition...
I found I had a few other .zenith directories hanging around in odd
places. I doubt we intended those directories to collect in multiple
locations, so only hide the one in the git root directory.
If there isn't any version specified for a dependency crate, Cargo may
choose a newer version. This could happen when Cargo.lock is updated
("cargo update") but can also happen unexpectedly when adding or
changing other dependencies. This can allow API-breaking changes to be
picked up, breaking the build.
To prevent this, specify versions for all dependencies. Cargo is still
allowed to pick newer versions that are (hopefully) non-breaking, by
analyzing the semver version number.
There are two special cases here:
1. serde_derive::{Serialize, Deserialize} isn't really used any more. It
was only a separate crate in the past because of compiler limitations.
Nowadays, people turn on the "derive" feature of the serde crate and
use serde::{Serialize, Deserialize}.
2. parse_duration is unmaintained and has an open security issue. (gh
iss. 87) That issue probably isn't critical for us because of where we
use that crate, but it's probably still better to pin the version so we
can't get hit with an API-breaking change at an awkward time.
It's quite hard to get python2 to exit gracefully when the code was
intended for python3, because the interpreter will SyntaxError before
running a single line of code. Thankfully, the pytest developers put a
version check in their .ini config, so that should gracefully handle
both wrong-pytest-version and wrong-python-version.
Also document the woes of trying to run the pytest version shipped by
e.g. Debian or Ubuntu.
Fetching the postgres submodule is one of the more expensive steps of
the build. Doing a shallow clone ("--depth 1") should save some time and
a lot of network bandwidth.
This does the postgres & rust builds, caching the results, and preserves
its outputs in a "workspace" for downstream test jobs (which can run in
parallel).
Pytest jobs are parameterized, so adding new pytest-based tests requires
only adding a new job to the "workflows" section at the end.
This could use some optimization:
- The "apt-get install" step is quite slow.
- The rust build step will always happen, even if only unrelated changes
are present (e.g. modified a python test file)
- Saving/restoring the rust cache (/target) is very slow (it contains
1.3GB of data)
- Saving the workspace is very slow.
- The "install" step is ugly; postgres and rust artifacts could take a
much better form.
Use pytest to manage background services, paths, and environment
variables.
Benefits:
- Tests are a little easier to write.
- Cleanup is more reliable. You can CTRL-C a test and it will still shut
down gracefully. If you manually start a conflicting process, the test
fixtures will detect this and abort at startup.
- Don't need to worry about remembering '--test-threads=1'
- Output of sub-processes can be captured to files.
- Test fixtures configure everything to operate under a single test
output directory, making it easier to capture logs in CI.
- Detects all the necessary paths if run from the git root, but can also
run from arbitrary paths by setting environment variables.
There is also a deliberately broken test (test_broken.py) that can be
used to test whether the test fixtures properly clean up after
themselves. It won't run by default; the comment at the top explains how
to enable it.
Remove the check that enforces running from the git root directory.
Discover the zenith binary path from current_exe().
Look for postgres in $POSTGRES_BIN or $CWD/tmp_install.
Previously, 'zenith init' would initialize a PostgreSQL cluster with
"initdb -D tmp", creating the temp cluster under current directory.
It moves the 'tmp' directory under the correct snapshot directory in
the zenith repository after that, but if something goes wrong in initdb,
or in the steps that follow, it could leave behind the 'tmp' directory
under current dir. Better to create the temporary directory under the
repository directory to begin with, as ".zenith/tmp".
- Move notifications to a separate job, run only on push.
- Build and test will execute on [pull_request, push].
- Use actions-rs/toolchain@v1 to get the rust toolchain.
- Add matrix hook to allow multiple toolchain versions in the future
(now set to [stable]).
- Run all the cargo tests, not just test_pageserver
Make the caller of request_redo() responsible for gathering the WAL records
to redo, and for storing the reconstructed page image back in the page
cache. This leaves the WAL redo manager purely responsible for dealing with
the postgres child process, removing its dependency on the PageCache.
Having multiple copies of the same values is a source of confusion.
Commit da9bf5dc63 fixed one race condition caused by that, for example.
See also discussion at
https://github.com/zenithdb/zenith/issues/57#issuecomment-824393470
This changes SeqWait.advance() to return the old number, and not panic if
you try to move the value backwards. The caller should check for that and
act accordingly.
Remove 'async' usage a much as feasible. Async code is harder to debug,
and mixing async and non-async code is a recipe for confusion and bugs.
There are a couple of exceptions:
- The code in walredo.rs, which needs to read and write to the child
process simultaneously, still uses async. It's more convenient there.
The 'async' usage is carefully limited to just the functions that
communicate with the child process.
- Code in walreceiver.rs that uses tokio-postgres to do streaming
replication. We have to use async there, because tokio-postgres is
async. Most rust-postgres functionality has non-async wrappers, but
not the new replication client code. The async usage is very limited
here, too: we use just block_on to call the tokio-postgres functions.
The code in 'page_service.rs' now launches a dedicated thread for each
connection.
This replaces tokio::sync:⌚:channel with std::sync:mpsc in
'seqwait.rs', to make that non-async. It's not a drop-in replacement,
though: std::sync::mpsc doesn't support multiple consumers, so we cannot
share a channel between multiple waiters. So this removes the code to
check if an existing channel can be reused, and creates a new one for
each waiter. That created another problem: BTreeMap cannot hold
duplicates, so I replaced that with BinaryHeap.
Similarly, the tokio::{mpsc, oneshot} channels used between WAL redo
manager and PageCache are replaced with std::sync::mpsc. (There is no
separate 'oneshot' channel in the standard library.)
Fixes github issue #58, and coincidentally also issue #66.
AtomicLsn is a wrapper around AtomicU64 that has load() and store()
members that are cheap (on x86, anyway) and can be safely used in any
context.
This commit uses AtomicLsn in the page cache, and fixes up some
downstream code that manually implemented LSN formatting.
There's also a bugfix to the logging in wait_lsn, which prints the
wrong lsn value.
SeqWait can use any type that is Ord + Debug + Copy. Debug is not
strictly necessary, but allows us to keep the panic message if a caller
wants the sequence number to go backwards.
This type is a zero-cost wrapper for a u64, meant to help code
communicate with precision what that value means.
It implements Display and Debug. Display "{}" will format as
"1234ABCD:5678CDEF" while Debug will format as Lsn{1234567890}.
It was only marked as async because it calls relsize_get(), but
relsize_get() will in fact never block when it's called with the max
LSN value, like put_wal_record() does. Refactor to avoid marking
put_wal_record() as 'async'.
After the rocksdb patch (commit 6aa38d3f7d), the CacheEntry struct was
used only momentarily in the communication between the page_cache and
the walredo modules. It was in fact not stored in any cache anymore.
For clarity, refactor the communication.
There is now a WalRedoManager struct, with `request_redo` function,
that can be used to request WAL replay of a particular page. It sends
a request to a queue like before, but the queue has been replaced with
tokio::sync::mpsc. Previously, the resulting page image was stored
directly in the CacheEntry, and the requestor was notified using a
condition variable. Now, the requestor includes a 'oneshot' channel in
the request, and the WAL redo manager sends the response there.
Clippy pointed out that `drop(waiters)` didn't do anything, because
there was a misplaced ";" causing `waiters` to be a unit type `()`.
This change makes it do what was intended: the lock should be dropped
first, then the wakeups should be processed.
The comment was incorrect, claiming that ZTimelineId is a 32-byte value.
It is actually 16 bytes wide. While we're at it, improve the comment,
explaining what a zenith timeline is, and why it's different from
PostgreSQL timelines.
When calling into the page cache, it was possible to wait on a blocking
mutex, which can stall the async executor.
Replace that sleep with a SeqWait::wait_for(lsn).await so that the
executor can go on with other work while we wait.
Change walreceiver_works to an AtomicBool to avoid the awkwardness of
taking the lock, then dropping it while we call wait_for and then
acquiring it again to do real work.
SeqWait adds a way to .await the arrival of some sequence number.
It provides wait_for(num) which is an async fn, and advance(num) which
is synchronous.
This should be useful in solving the page cache deadlocks, and may be
useful in other areas too.
This implementation still uses a Mutex internally, but only for a brief
critical section. If we find this code broadly useful and start to care
more about executor stalls due to unfair thread scheduling, there might
be ways to make it lock-free.
- remove needless return
- remove needless format!
- remove a few more needless clone()
- from_str_radix(_, 10) -> .parse()
- remove needless reference
- remove needless `mut`
Also manually replaced a match statement with map_err() because after
clippy was done with it, there was almost nothing left in the match
expression.
# This file is only read when `yapf` is run from this directory.
# Hence we only top-level directories here to avoid confusion.
# See source code for the exact file format: https://github.com/google/yapf/blob/c6077954245bc3add82dafd853a1c7305a6ebd20/yapf/yapflib/file_resources.py#L40-L43
Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes
Neon is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
The project used to be called "Zenith". Many of the commands and code comments
still refer to "zenith", but we are in the process of renaming things.
## Architecture overview
A Neon installation consists of compute nodes and Neon storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by Neon storage engine.
Neon storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
- WAL service. The service that receives WAL from compute node and ensures that it is stored durably.
Pageserver consists of:
- Repository - Neon storage implementation.
- WAL receiver - service that receives WAL from WAL service and stores it in the repository.
- Page service - service that communicates with compute nodes and responds with pages from the repository.
- WAL redo - service that builds pages from base images and WAL records on Page service request.
2. Start pageserver and postggres on top of it (should be called from repo root):
[Rust] 1.58 or later is also required.
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
2. Build neon and patched postgres
```sh
# Create ~/.zenith with proper paths to binaries and data
make# builds also postgres and installs it to ./tmp_install
./scripts/pytest
```
## Source tree layout
## Documentation
/walkeeper:
Now we use README files to cover design ideas and overall architecture for each module and `rustdoc` style documentation comments. See also [/docs/](/docs/) a top-level overview of all available markdown documentation.
WAL safekeeper. Written in Rust.
- [/docs/sourcetree.md](/docs/sourcetree.md) contains overview of source tree layout.
/pageserver:
To view your `rustdoc` documentation in a browser, try running `cargo doc --no-deps --open`
Page Server. Written in Rust.
### Postgres-specific terms
Depends on the modified 'postgres' binary for WAL redo.
Due to Neon's very close relation with PostgreSQL internals, there are numerous specific terms used.
Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
/integration_tests:
Tests with different combinations of a Postgres compute node, WAL safekeeper and Page Server.
/mgmt-console:
Web UI to launch (modified) Postgres servers, using S3 as the backing store. Written in Python.
This is somewhat outdated, as it doesn't use the WAL safekeeper or Page Servers.
/vendor/postgres:
PostgreSQL source tree, with the modifications needed for Zenith.
/vendor/postgres/src/bin/safekeeper:
Extension (safekeeper_proxy) that runs in the compute node, and connects to the WAL safekeepers
Current state of authentication includes usage of JWT tokens in communication between compute and pageserver and between CLI and pageserver. JWT token is signed using RSA keys. CLI generates a key pair during call to `zenith init`. Using following openssl commands:
CLI also generates signed token and saves it in the config for later access to pageserver. Now authentication is optional. Pageserver has two variables in config: `auth_validation_public_key_path` and `auth_type`, so when auth type present and set to `ZenithJWT` pageserver will require authentication for connections. Actual JWT is passed in password field of connection string. There is a caveat for psql, it silently truncates passwords to 100 symbols, so to correctly pass JWT via psql you have to either use PGPASSWORD environment variable, or store password in psql config file.
Currently there is no authentication between compute and safekeepers, because this communication layer is under heavy refactoring. After this refactoring support for authentication will be added there too. Now safekeeper supports "hardcoded" token passed via environment variable to be able to use callmemaybe command in pageserver.
Compute uses token passed via environment variable to communicate to pageserver and in the future to the safekeeper too.
JWT authentication now supports two scopes: tenant and pageserverapi. Tenant scope is intended for use in tenant related api calls, e.g. create_branch. Compute launched for particular tenant also uses this scope. Scope pageserver api is intended to be used by console to manage pageserver. For now we have only one management operation - create tenant.
The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax.
To recap, the problem is that the XLOG_HEAP_INSERT record does not include the command id of the inserted row. And same with deletion/update. So in the primary, a row is inserted with current xmin + cmin. But in the replica, the cmin is always set to 1. That works, because the command id is only relevant to the inserting transaction itself. After commit/abort, no one cares abut it anymore.
- Alternatives?
I don't know
2. Add PD_WAL_LOGGED.
- Why?
Postgres sometimes writes data to the page before it is wal-logged. If such page ais swapped out, we will loose this change. The problem is currently solved by setting PD_WAL_LOGGED bit in page header. When page without this bit set is written to the SMGR, then it is forced to be written to the WAL as FPI using log_newpage_copy() function.
There was wrong assumption that it can happen only during construction of some exotic indexes (like gist). It is not true. The same situation can happen with COPY,VACUUM and when record hint bits are set.
Do not store this flag in page header, but associate this bit with shared buffer. Logically it is more correct but in practice we will get not advantages: neither in space, neither in CPU overhead.
3. XLogReadBufferForRedo not always loads and pins requested buffer. So we need to add extra checks that buffer is really pinned. Also do not use BufferGetBlockNumber for buffer returned by XLogReadBufferForRedo.
- Why?
XLogReadBufferForRedo is not pinning pages which are not requested by wal-redo. It is specific only for wal-redo Postgres.
- Alternatives?
No
4. Eliminate reporting of some warnings related with hint bits, for example
"page is not marked all-visible but visibility map bit is set in relation".
- Why?
Hint bit may be not WAL logged.
- Alternative?
Always wal log any page changes.
5. Maintain last written LSN.
- Why?
When compute node requests page from page server, we need to specify LSN. Ideally it should be LSN
of WAL record performing last update of this pages. But we do not know it, because we do not have page.
We can use current WAL flush position, but in this case there is high probability that page server
will be blocked until this peace of WAL is delivered.
As better approximation we can keep max LSN of written page. It will be better to take in account LSNs only of evicted pages,
but SMGR API doesn't provide such knowledge.
- Alternatives?
Maintain map of LSNs of evicted pages.
6. Launching Postgres without WAL.
- Why?
According to Zenith architecture compute node is stateless. So when we are launching
compute node, we need to provide some dummy PG_DATADIR. Relation pages
can be requested on demand from page server. But Postgres still need some non-relational data:
control and configuration files, SLRUs,...
It is currently implemented using basebackup (do not mix with pg_basebackup) which is created
by pageserver. It includes in this tarball config/control files, SLRUs and required directories.
As far as pageserver do not have original (non-scattered) WAL segments, it includes in
this tarball dummy WAL segment which contains only SHUTDOWN_CHECKPOINT record at the beginning of segment,
which redo field points to the end of wal. It allows to load checkpoint record in more or less
standard way with minimal changes of Postgres, but then some special handling is needed,
including restoring previous record position from zenith.signal file.
Also we have to correctly initialize header of last WAL page (pointed by checkpoint.redo)
to pass checks performed by XLogReader.
- Alternatives?
We may not include fake WAL segment in tarball at all and modify xlog.c to load checkpoint record
in special way. But it may only increase number of changes in xlog.c
7. Add redo_read_buffer_filter callback to XLogReadBufferForRedoExtended
- Why?
We need a way in wal-redo Postgres to ignore pages which are not requested by pageserver.
So wal-redo Postgres reconstructs only requested page and for all other returns BLK_DONE
which means that recovery for them is not needed.
- Alternatives?
No
8. Enforce WAL logging of sequence updates.
- Why?
Due to performance reasons Postgres don't want to log each fetching of a value from a sequence,
so we pre-log a few fetches in advance. In the event of crash we can lose
(skip over) as many values as we pre-logged.
But it doesn't work with Zenith because page with sequence value can be evicted from buffer cache
and we will get a gap in sequence values even without crash.
- Alternatives:
Do not try to preserve sequential order but avoid performance penalty.
9. Treat unlogged tables as normal (permanent) tables.
- Why?
Unlogged tables are not transient, so them have to survive node restart (unlike temporary tables).
But as far as compute node is stateless, we need to persist their data to storage node.
And it can only be done through the WAL.
- Alternatives?
* Store unlogged tables locally (violates requirement of stateless compute nodes).
* Prohibit unlogged tables at all.
10. Support start Postgres in wal-redo mode
- Why?
To be able to apply WAL record and reconstruct pages at page server.
- Alternatives?
* Rewrite redo handlers in Rust
* Do not reconstruct pages at page server at all and do it at compute node.
11. WAL proposer
- Why?
WAL proposer is communicating with safekeeper and ensures WAL durability by quorum writes.
It is currently implemented as patch to standard WAL sender.
- Alternatives?
Can be moved to extension if some extra callbacks will be added to wal sender code.
12. Secure Computing BPF API wrapper.
- Why?
Pageserver delegates complex WAL decoding duties to Postgres,
which means that the latter might fall victim to carefully designed
malicious WAL records and start doing harmful things to the system.
To prevent this, it has been decided to limit possible interactions
with the outside world using the Secure Computing BPF mode.
- Alternatives:
* Rewrite redo handlers in Rust.
* Add more checks to guarantee correctness of WAL records.
* Move seccomp.c to extension
* Many other discussed approaches to neutralize incorrect WAL records vulnerabilities.
13. Callbacks for replica feedbacks
- Why?
Allowing waproposer to interact with walsender code.
- Alternatives
Copy walsender code to walproposer.
14. Support multiple SMGR implementations.
- Why?
Postgres provides abstract API for storage manager but it has only one implementation
and provides no way to replace it with custom storage manager.
- Alternatives?
None.
15. Calculate database size as sum of all database relations.
- Why?
Postgres is calculating database size by traversing data directory
but as far as Zenith compute node is stateless we can not do it.
- Alternatives?
Send this request directly to pageserver and calculate real (physical) size
of Zenith representation of database/timeline, rather than sum logical size of all relations.
-----------------------------------------------
Not currently committed but proposed:
1. Disable ring buffer buffer manager strategies
- Why?
Postgres tries to avoid cache flushing by bulk operations (copy, seqscan, vacuum,...).
Even if there are free space in buffer cache, pages may be evicted.
Negative effect of it can be somehow compensated by file system cache, but in case of Zenith
cost of requesting page from page server is much higher.
- Alternatives?
Instead of just prohibiting ring buffer we may try to implement more flexible eviction policy,
for example copy evicted page from ring buffer to some other buffer if there is free space
in buffer cache.
2. Disable marking page as dirty when hint bits are set.
- Why?
Postgres has to modify page twice: first time when some tuple is updated and second time when
hint bits are set. Wal logging hint bits updates requires FPI which significantly increase size of WAL.
- Alternatives?
Add special WAL record for setting page hints.
3. Prefetching
- Why?
As far as pages in Zenith are loaded on demand, to reduce node startup time
and also sppedup some massive queries we need some mechanism for bulk loading to
reduce page request round-trip overhead.
Currently Postgres is supporting prefetching only for bitmap scan.
In Zenith we also use prefetch for sequential and index scan. For sequential scan we prefetch
some number of following pages. For index scan we prefetch pages of heap relation addressed by TIDs.
4. Prewarming.
- Why?
Short downtime (or, in other words, fast compute node restart time) is one of the key feature of Zenith.
But overhead of request-response round-trip for loading pages on demand can make started node warm-up quite slow.
We can capture state of compute node buffer cache and send bulk request for this pages at startup.
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres).
Backpressure is used to limit the lag between pageserver and compute node or WAL service.
If compute node or WAL service run far ahead of Page Server,
the time of serving page requests increases. This may lead to timeout errors.
To tune backpressure limits use `max_replication_write_lag`, `max_replication_flush_lag` and `max_replication_apply_lag` settings.
When lag between current LSN (pg_current_wal_flush_lsn() at compute node) and minimal write/flush/apply position of replica exceeds the limit
backends performing writes are blocked until the replica is caught up.
### Base image (page image)
### Basebackup
A tarball with files needed to bootstrap a compute node[] and a corresponding command to create it.
NOTE:It has nothing to do with PostgreSQL pg_basebackup.
### Branch
We can create branch at certain LSN using `zenith timeline branch` command.
Each Branch lives in a corresponding timeline[] and has an ancestor[].
### Checkpoint (PostgreSQL)
NOTE: This is an overloaded term.
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
### Checkpoint (Layered repository)
NOTE: This is an overloaded term.
Whenever enough WAL has been accumulated in memory, the page server []
writes out the changes from the in-memory layer into a new delta layer file. This process
is called "checkpointing".
Configuration parameter `checkpoint_distance` defines the distance
from current LSN to perform checkpoint of in-memory layers.
Default is `DEFAULT_CHECKPOINT_DISTANCE`.
### Compaction
A background operation on layer files. Compaction takes a number of L0
layer files, each of which covers the whole key space and a range of
LSN, and reshuffles the data in them into L1 files so that each file
covers the whole LSN range, but only part of the key space.
Compaction should also opportunistically leave obsolete page versions
from the L1 files, and materialize other page versions for faster
access. That hasn't been implemented as of this writing, though.
### Compute node
Stateless Postgres node that stores data in pageserver.
### Garbage collection
The process of removing old on-disk layers that are not needed by any timeline anymore.
### Fork
Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map.
### Layer
A layer contains data needed to reconstruct any page versions within the
layer's Segment and range of LSNs.
There are two kinds of layers, in-memory and on-disk layers. In-memory
layers are used to ingest incoming WAL, and provide fast access
to the recent page versions. On-disk layers are stored as files on disk, and
are immutable. See pageserver/src/layered_repository/README.md for more.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
files are called "layer files". There are two kinds of layer files:
image files and delta files. An image file contains a "snapshot" of a
range of keys at a particular LSN, and a delta file contains WAL
records applicable to a range of keys, in a range of LSNs.
### Layer map
The layer map tracks what layers exist in a timeline.
### Layered repository
Zenith repository implementation that keeps data in layers.
### LSN
The Log Sequence Number (LSN) is a unique identifier of the WAL record[] in the WAL log.
The insert position is a byte offset into the logs, increasing monotonically with each new record.
Internally, an LSN is a 64-bit integer, representing a byte position in the write-ahead log stream.
It is printed as two hexadecimal numbers of up to 8 digits each, separated by a slash.
Check also [PostgreSQL doc about pg_lsn type](https://www.postgresql.org/docs/devel/datatype-pg-lsn.html)
Values can be compared to calculate the volume of WAL data that separates them, so they are used to measure the progress of replication and recovery.
In postgres and Zenith lsns are used to describe certain points in WAL handling.
PostgreSQL LSNs and functions to monitor them:
*`pg_current_wal_insert_lsn()` - Returns the current write-ahead log insert location.
*`pg_current_wal_lsn()` - Returns the current write-ahead log write location.
*`pg_current_wal_flush_lsn()` - Returns the current write-ahead log flush location.
*`pg_last_wal_receive_lsn()` - Returns the last write-ahead log location that has been received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically.
*`pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
The Page Service listens for GetPage@LSN requests from the Compute Nodes,
and responds with pages from the repository.
### PITR (Point-in-time-recovery)
PostgreSQL's ability to restore up to a specified LSN.
### Primary node
### Proxy
Postgres protocol proxy/router.
This service listens psql port, can check auth via external service
and create new databases and accounts (control plane API in our case).
### Relation
The generic term in PostgreSQL for all objects in a database that have a name and a list of attributes defined in a specific order.
### Replication slot
### Replica node
### Repository
Repository stores multiple timelines, forked off from the same initial call to 'initdb'
and has associated WAL redo service.
One repository corresponds to one Tenant.
### Retention policy
How much history do we need to keep around for PITR and read-only nodes?
### Segment
A physical file that stores data for a given relation. File segments are
limited in size by a compile-time setting (1 gigabyte by default), so if a
relation exceeds that size, it is split into multiple segments.
### SLRU
SLRUs include pg_clog, pg_multixact/members, and
pg_multixact/offsets. There are other SLRUs in PostgreSQL, but
they don't need to be stored permanently (e.g. pg_subtrans),
or we do not support them in zenith yet (pg_commit_ts).
### Tenant (Multitenancy)
Tenant represents a single customer, interacting with Zenith.
Wal redo[] activity, timelines[], layers[] are managed for each tenant independently.
One pageserver[] can serve multiple tenants at once.
One safekeeper
See `docs/multitenancy.md` for more.
### Timeline
Timeline accepts page changes and serves get_page_at_lsn() and
get_rel_size() requests. The term "timeline" is used internally
in the system, but to users they are exposed as "branches", with
human-friendly names.
NOTE: this has nothing to do with PostgreSQL WAL timelines.
### XLOG
PostgreSQL alias for WAL[].
### WAL (Write-ahead log)
The journal that keeps track of the changes in the database cluster as user- and system-invoked operations take place. It comprises many individual WAL records[] written sequentially to WAL files[].
### WAL acceptor, WAL proposer
In the context of the consensus algorithm, the Postgres
compute node is also known as the WAL proposer, and the safekeeper is also known
as the acceptor. Those are the standard terms in the Paxos algorithm.
### WAL receiver (WAL decoder)
The WAL receiver connects to the external WAL safekeeping service (or
directly to the primary) using PostgreSQL physical streaming
replication, and continuously receives WAL. It decodes the WAL records,
and stores them to the repository.
We keep one WAL receiver active per timeline.
### WAL record
A low-level description of an individual data change.
### WAL redo
A service that runs PostgreSQL in a special wal_redo mode
to apply given WAL records over an old page image and return new page image.
### WAL safekeeper
One node that participates in the quorum. All the safekeepers
together form the WAL service.
### WAL segment (WAL file)
Also known as WAL segment or WAL segment file. Each of the sequentially-numbered files that provide storage space for WAL. The files are all of the same predefined size and are written in sequential order, interspersing changes as they occur in multiple simultaneous sessions.
### WAL service
The service as whole that ensures that WAL is stored durably.
Zenith supports multitenancy. One pageserver can serve multiple tenants at once. Tenants can be managed via zenith CLI. During page server setup tenant can be created using ```zenith init --create-tenant``` Also tenants can be added into the system on the fly without pageserver restart. This can be done using the following cli command: ```zenith tenant create``` Tenants use random identifiers which can be represented as a 32 symbols hexadecimal string. So zenith tenant create accepts desired tenant id as an optional argument. The concept of timelines/branches is working independently per tenant.
### Tenants in other commands
By default during `zenith init` new tenant is created on the pageserver. Newly created tenant's id is saved to cli config, so other commands can use it automatically if no direct arugment `--tenantid=<tenantid>` is provided. So generally tenantid more frequently appears in internal pageserver interface. Its commands take tenantid argument to distinguish to which tenant operation should be applied. CLI support creation of new tenants.
On the page server tenants introduce one level of indirection, so data directory structured the following way:
```
<pageserver working directory>
├── pageserver.log
├── pageserver.pid
├── pageserver.toml
└── tenants
├── 537cffa58a4fa557e49e19951b5a9d6b
├── de182bc61fb11a5a6b390a8aed3a804a
└── ee6016ec31116c1b7c33dfdfca38891f
```
Wal redo activity and timelines are managed for each tenant independently.
For local environment used for example in tests there also new level of indirection for tenants. It touches `pgdatadirs` directory. Now it contains `tenants` subdirectory so the structure looks the following way:
```
pgdatadirs
└── tenants
├── de182bc61fb11a5a6b390a8aed3a804a
│ └── main
└── ee6016ec31116c1b7c33dfdfca38892f
└── main
```
### Changes to postgres
Tenant id is passed to postgres via GUC the same way as the timeline. Tenant id is added to commands issued to pageserver, namely: pagestream, callmemaybe. Tenant id is also exists in ServerInfo structure, this is needed to pass the value to wal receiver to be able to forward it to the pageserver.
### Safety
For now particular tenant can only appear on a particular pageserver. Set of safekeepers are also pinned to particular (tenantid, timeline) pair so there can only be one writer for particular (tenantid, timeline).
This feature allows to migrate a timeline from one pageserver to another by utilizing remote storage capability.
### Migration process
Pageserver implements two new http handlers: timeline attach and timeline detach.
Timeline migration is performed in a following way:
1. Timeline attach is called on a target pageserver. This asks pageserver to download latest checkpoint uploaded to s3.
2. For now it is necessary to manually initialize replication stream via callmemaybe call so target pageserver initializes replication from safekeeper (it is desired to avoid this and initialize replication directly in attach handler, but this requires some refactoring (probably [#997](https://github.com/zenithdb/zenith/issues/997)/[#1049](https://github.com/zenithdb/zenith/issues/1049))
3. Replication state can be tracked via timeline detail pageserver call.
4. Compute node should be restarted with new pageserver connection string. Issue with multiple compute nodes for one timeline is handled on the safekeeper consensus level. So this is not a problem here.Currently responsibility for rescheduling the compute with updated config lies on external coordinator (console).
5. Timeline is detached from old pageserver. On disk data is removed.
### Implementation details
Now safekeeper needs to track which pageserver it is replicating to. This introduces complications into replication code:
* We need to distinguish different pageservers (now this is done by connection string which is imperfect and is covered here: https://github.com/zenithdb/zenith/issues/1105). Callmemaybe subscription management also needs to track that (this is already implemented).
* We need to track which pageserver is the primary. This is needed to avoid reconnections to non primary pageservers. Because we shouldn't reconnect to them when they decide to stop their walreceiver. I e this can appear when there is a load on the compute and we are trying to detach timeline from old pageserver. In this case callmemaybe will try to reconnect to it because replication termination condition is not met (page server with active compute could never catch up to the latest lsn, so there is always some wal tail)
Simplify storage operations for people => Gain adoption/installs on laptops and small private installation => Attract customers to DBaaS by seamless integration between our tooling and cloud.
Proposed architecture addresses:
- High availability -- tolerates n/2 - 1 failures
- Multi-tenancy -- one storage for all databases
- Elasticity -- increase storage size on the go by adding nodes
- Snapshots / backups / PITR with S3 offload
- Compression
Minuses are:
- Quite a lot of work
- Single page access may touch few disk pages
- Some bloat in data — may slowdown sequential scans
## **Summary**
Storage cluster is sharded key-value store with ordered keys. Key (****page_key****) is a tuple of `(pg_id, db_id, timeline_id, rel_id, forkno, segno, pageno, lsn)`. Value is either page or page diff/wal. Each chunk (chunk == shard) stores approx 50-100GB ~~and automatically splits in half when grows bigger then soft 100GB limit~~. by having a fixed range of pageno's it is responsible for. Chunks placement on storage nodes is stored in a separate metadata service, so chunk can be freely moved around the cluster if it is need. Chunk itself is a filesystem directory with following sub directories:
```
|-chunk_42/
|-store/ -- contains lsm with pages/pagediffs ranging from
| page_key_lo to page_key_hi
|-wal/
| |- db_1234/ db-specific wal files with pages from page_key_lo
| to page_key_hi
|
|-chunk.meta -- small file with snapshot references
(page_key_prefix+lsn+name)
and PITR regions (page_key_start, page_key_end)
```
## **Chunk**
Chunk is responsible for storing pages potentially from different databases and relations. Each page is addressed by a lexicographically ordered tuple (****page_key****) with following fields:
-`pg_id` -- unique id of given postgres instance (or postgres cluster as it is called in postgres docs)
-`db_id` -- database that was created by 'CREATE DATABASE' in a given postgres instance
-`db_timeline` -- used to create Copy-on-Write instances from snapshots, described later
-`rel_id` -- tuple of (relation_id, 0) for tables and (indexed_relation_id, rel_id) for indices. Done this way so table indices were closer to table itself on our global key space.
-`(forkno, segno, pageno)` -- page coordinates in postgres data files
-`lsn_timeline` -- postgres feature, increments when PITR was done.
-`lsn` -- lsn of current page version.
Chunk stores pages and page diffs ranging from page_key_lo to page_key_hi. Processing node looks at page in wal record and sends record to a chunk responsible for this page range. When wal record arrives to a chunk it is initially stored in `chunk_id/wal/db_id/wal_segno.wal`. Then background process moves records from that wal files to the lsm tree in `chunk_id/store`. Or, more precisely, wal records would be materialized into lsm memtable and when that memtable is flushed to SSTable on disk we may trim the wal. That way some not durably (in the distributed sense) committed pages may enter the tree -- here we rely on processing node behavior: page request from processing node should contain proper lsm horizons so that storage node may respond with proper page version.
LSM here is a usual LSM for variable-length values: at first data is stored in memory (we hold incoming wal records to be able to regenerate it after restart) at some balanced tree. When this tree grows big enough we dump it into disk file (SSTable) sorting records by key. Then SStables are mergesorted in the background to a different files. All file operation are sequential and do not require WAL for durability.
So query for `pageno=42 up to lsn=260` would need to find closest entry less then this key, iterate back to the latest full page and iterate forward to apply diffs. How often page is materialized in lsn-version sequence is up to us -- let's say each 5th version should be a full page.
### **Page deletion**
To delete old pages we insert blind deletion marker `(pg_id, db_id, #trim_lsn < 150)` into a lsm tree. During merges such marker would indicate that all pages with smaller lsn should be discarded. Delete marker will travel down the tree levels hierarchy until it reaches last level. In non-PITR scenario where old page version are not needed at all such deletion marker would (in average) prevent old page versions propagation down the tree -- so all bloat would concentrate at higher tree layers without affecting bigger bottom layers.
### **Recovery**
Upon storage node restart recent WAL files are applied to appropriate pages and resulting pages stored in lsm memtable. So this should be fast since we are not writing anything to disk.
### **Checkpointing**
No such mechanism is needed. Or we may look at the storage node as at kind of continuous chekpointer.
### **Full page writes (torn page protection)**
Storage node never updates individual pages, only merges SSTable, so torn pages is not an issue.
### **Snapshot**
That is the part that I like about this design -- snapshot creation is instant and cheap operation that can have flexible granularity level: whole instance, database, table. Snapshot creation inserts a record in `chunk.meta` file with lsn of this snapshot and key prefix `(pg_id, db_id, db_timeline, rel_id, *)` that prohibits pages deletion within this range. Storage node may not know anything about page internals, but by changing number of fields in our prefix we may change snapshot granularity.
It is again useful to remap `rel_id` to `(indexed_relation_id, rel_id)` so that snapshot of relation would include it's indices. Also table snapshot would trickily interact with catalog. Probably all table snapshots should hold also a catalog snapshot. And when node is started with such snapshot it should check that only tables from snapshot are queried. I assume here that for snapshot reading one need to start a new postgres instance.
Storage consumed by snapshot is proportional to the amount of data changed. We may have some heuristic (calculated based on cost of different storages) about when to offload old snapshot to s3. For example, if current database has more then 40% of changed pages with respect to previous snapshot then we may offload that snapshot to s3, and release this space.
**Starting db from snapshot**
When we are starting database from snapshot it can be done in two ways. First, we may create new db_id, move all the data from snapshot to a new db and start a database. Second option is to create Copy-on-Write (CoW) instance out of snapshot and read old pages from old snapshot and store new pages separately. That is why there is `db_timeline` key field near `db_id` -- CoW (🐮) database should create new `db_timeline` and remember old `db_timeline`. Such a database can have hashmap of pages that it is changed to query pages from proper snapshot on the first try. `db_timeline` is located near `db_id` so that new page versions generated by new instance would not bloat data of initial snapshot. It is not clear for whether it is possibly to effectively support "stacked" CoW snapshot, so we may disallow them. (Well, one way to support them is to move `db_timeline` close to `lsn` -- so we may scan neighboring pages and find right one. But again that way we bloat snapshot with unrelated data and may slowdown full scans that are happening in different database).
**Snapshot export/import**
Once we may start CoW instances it is easy to run auxiliary postgres instance on this snapshot and run `COPY FROM (...) TO stdout` or `pg_dump` and export data from the snapshot to some portable formats. Also we may start postgres on a new empty database and run `COPY FROM stdin`. This way we can initialize new non-CoW databases and transfer snapshots via network.
### **PITR area**
In described scheme PITR is just a prohibition to delete any versions within some key prefix, either it is a database or a table key prefix. So PITR may have different settings for different tables, databases, etc.
PITR is quite bloaty, so we may aggressively offload it to s3 -- we may push same (or bigger) SSTables to s3 and maintain lsm structure there.
### **Compression**
Since we are storing page diffs of variable sizes there is no structural dependency on a page size and we may compress it. Again that could be enabled only on pages with some key prefixes, so we may have this with db/table granularity.
### **Chunk metadata**
Chunk metadata is a file lies in chunk directory that stores info about current snapshots and PITR regions. Chunck should always consult this data when merging SSTables and applying delete markers.
### **Chunk splitting**
*(NB: following paragraph is about how to avoid page splitting)*
When chunks hits some soft storage limit (let's say 100Gb) it should be split in half and global matadata about chunk boundaries should be updated. Here i assume that chunk split is a local operation happening on single node. Process of chink splitting should look like following:
1. Find separation key and spawn two new chunks with [lo, mid) [mid, hi) boundaries.
2. Prohibit WAL deletion and old SSTables deletion on original chunk.
3. On each lsm layer we would need to split only one SSTable, all other would fit within left or right range. Symlink/split that files to new chunks.
4. Start WAL replay on new chunks.
5. Update global metadata about new chunk boundaries.
6. Eventually (metadata update should be pushed to processing node by metadata service) storage node will start sending WAL and page requests to the new nodes.
7. New chunk may start serving read queries when following conditions are met:
a) it receives at least on WAL record from processing node
b) it replayed all WAL up to the new received one
c) checked by downlinks that there were no WAL gaps.
Chunk split as it is described here is quite fast operation when it is happening on the local disk -- vast majority of files will be just moved without copying anything. I suggest to keep split always local and not to mix it with chunk moving around cluster. So if we want to split some chunk but there is small amount of free space left on the device, we should first move some chunks away from the node and then proceed with splitting.
### Fixed chunks
Alternative strategy is to not to split at all and have pageno-fixed chunk boundaries. When table is created we first materialize this chunk by storing first new pages only and chunks is small. Then chunk is growing while table is filled, but it can't grow substantially bigger then allowed pageno range, so at max it would be 1GB or whatever limit we want + some bloat due to snapshots and old page versions.
### **Chunk lsm internals**
So how to implement chunk's lsm?
- Write from scratch and use RocksDB to prototype/benchmark, then switch to own lsm implementation. RocksDB can provide some sanity check for performance of home-brewed implementation and it would be easier to prototype.
- Use postgres as lego constructor. We may model memtable with postgres B-tree referencing some in-memory log of incoming records. SSTable merging may reuse postgres external merging algorithm, etc. One thing that would definitely not fit (or I didn't came up with idea how to fit that) -- is multi-tenancy. If we are storing pages from different databases we can't use postgres buffer pool, since there is no db_id in the page header. We can add new field there but IMO it would be no go for committing that to vanilla.
Other possibility is to not to try to fit few databases in one storage node. But that way it is no go for multi-tenant cloud installation: we would need to run a lot of storage node instances on one physical storage node, all with it own local page cache. So that would be much closer to ordinary managed RDS.
Multi-tenant storage makes sense even on a laptop, when you work with different databases, running tests with temp database, etc. And when installation grows bigger it start to make more and more sense, so it seems important.
# Storage fleet
# **Storage fleet**
- When database is smaller then a chunk size we naturally can store them in one chunk (since their page_key would fit in some chunk's [hi, lo) range).
Few databases are stored in one chunk, replicated three times
- When database can't fit into one storage node it can occupy lots of chunks that were split while database was growing. Chunk placement on nodes is controlled by us with some automatization, but we alway may manually move chunks around the cluster.
Here one big database occupies two set of nodes. Also some chunks were moved around to restore replication factor after disk failure. In this case we also have "sharded" storage for a big database and issue wal writes to different chunks in parallel.
## **Chunk placement strategies**
There are few scenarios where we may want to move chunks around the cluster:
- disk usage on some node is big
- some disk experienced a failure
- some node experienced a failure or need maintenance
## **Chunk replication**
Chunk replication may be done by cloning page ranges with respect to some lsn from peer nodes, updating global metadata, waiting for WAL to come, replaying previous WAL and becoming online -- more or less like during chunk split.
Zenith CLI as it is described here mostly resides on the same conceptual level as pg_ctl/initdb/pg_recvxlog/etc and replaces some of them in an opinionated way. I would also suggest bundling our patched postgres inside zenith distribution at least at the start.
This proposal is focused on managing local installations. For cluster operations, different tooling would be needed. The point of integration between the two is storage URL: no matter how complex cluster setup is it may provide an endpoint where the user may push snapshots.
The most important concept here is a snapshot, which can be created/pushed/pulled/exported. Also, we may start temporary read-only postgres instance over any local snapshot. A more complex scenario would consist of several basic operations over snapshots.
## Pull snapshot with some publicly shared database
Since we may export the whole snapshot as one big file (tar of basebackup, maybe with some manifest) it may be shared over conventional means: http, ssh, [git+lfs](https://docs.github.com/en/github/managing-large-files/about-git-large-file-storage).
```
> zenith pg create --snapshot http://learn-postgres.com/movies_db.zenith movies
One way to rollback the database is just to init a new database from the snapshot and destroy the old one. But creating a new database from a snapshot would require a copy of that snapshot which is time consuming operation. Another option that would be cool to support is the ability to create the copy-on-write database from the snapshot without copying data, and store updated pages in a separate location, however that way would have performance implications. So to properly rollback the database to the older state we have `zenith pg checkout`.
```
> zenith pg list
ID PGDATA USED STORAGE ENDPOINT
primary1 pgdata1 5G zenith-local localhost:5432
> zenith snapshot create pgdata1@snap1
> zenith snapshot list
ID SIZE PARENT
oldpg 5G -
pgdata1@snap1 6G -
pgdata1@CURRENT 6G -
> zenith pg checkout pgdata1@snap1
Stopping postgres on pgdata1.
Rolling back pgdata1@CURRENT to pgdata1@snap1.
Starting postgres on pgdata1.
> zenith snapshot list
ID SIZE PARENT
oldpg 5G -
pgdata1@snap1 6G -
pgdata1@HEAD{0} 6G -
pgdata1@CURRENT 6G -
```
Some notes: pgdata1@CURRENT -- implicit snapshot representing the current state of the database in the data directory. When we are checking out some snapshot CURRENT will be set to this snapshot and the old CURRENT state will be named HEAD{0} (0 is the number of postgres timeline, it would be incremented after each such checkout).
## Configure PITR area (Point In Time Recovery).
PITR area acts like a continuous snapshot where you can reset the database to any point in time within this area (by area I mean some TTL period or some size limit, both possibly infinite).
Resetting the database to some state in past would require creating a snapshot on some lsn / time in this pirt area.
# Manual
## storage
Storage is either zenith pagestore or s3. Users may create a database in a pagestore and create/move *snapshots* and *pitr regions* in both pagestore and s3. Storage is a concept similar to `git remote`. After installation, I imagine one local storage is available by default.
**zenith storage attach** -t [native|s3] -c key=value -n name
Attaches/initializes storage. For --type=s3, user credentials and path should be provided. For --type=native we may support --path=/local/path and --url=zenith.tech/stas/mystore. Other possible term for native is 'zstore'.
Manages postgres data directories and can start postgreses with proper configuration. An experienced user may avoid using that (except pg create) and configure/run postgres by themself.
Pg is a term for a single postgres running on some data. I'm trying to avoid here separation of datadir management and postgres instance management -- both that concepts bundled here together.
Creates (initializes) new data directory in given storage and starts postgres. I imagine that storage for this operation may be only local and data movement to remote location happens through snapshots/pitr.
--no-start: just init datadir without creating
--snapshot snap: init from the snapshot. Snap is a name or URL (zenith.tech/stas/mystore/snap1)
--cow: initialize Copy-on-Write data directory on top of some snapshot (makes sense if it is a snapshot of currently running a database)
**zenith pg destroy**
**zenith pg start** [--replica] pgdata
Start postgres with proper extensions preloaded/installed.
**zenith pg checkout**
Rollback data directory to some previous snapshot.
**zenith pg stop** pg_id
**zenith pg list**
```
ROLE PGDATA USED STORAGE ENDPOINT
primary my_pg 5.1G local localhost:5432
replica-1 localhost:5433
replica-2 localhost:5434
primary my_pg2 3.2G local.compr localhost:5435
- my_pg3 9.2G local.compr -
```
**zenith pg show**
```
my_pg:
storage: local
space used on local: 5.1G
space used on all storages: 15.1G
snapshots:
on local:
snap1: 1G
snap2: 1G
on zcloud:
snap2: 1G
on s3tank:
snap5: 2G
pitr:
on s3tank:
pitr_one_month: 45G
```
**zenith pg start-rest/graphql** pgdata
Starts REST/GraphQL proxy on top of postgres master. Not sure we should do that, just an idea.
## snapshot
Snapshot creation is cheap -- no actual data is copied, we just start retaining old pages. Snapshot size means the amount of retained data, not all data. Snapshot name looks like pgdata_name@tag_name. tag_name is set by the user during snapshot creation. There are some reserved tag names: CURRENT represents the current state of the data directory; HEAD{i} represents the data directory state that resided in the database before i-th checkout.
**zenith snapshot create** pgdata_name@snap_name
Creates a new snapshot in the same storage where pgdata_name exists.
Produces binary stream of a given snapshot. Under the hood starts temp read-only postgres over this snapshot and sends basebackup stream. Receiving side should start `zenith snapshot recv` before push happens. If url has some special schema like zenith:// receiving side may require auth start `zenith snapshot recv` on the go.
**zenith snapshot recv**
Starts a port listening for a basebackup stream, prints connection info to stdout (so that user may use that in push command), and expects data on that socket.
**zenith snapshot pull** --from url or path
Connects to a remote zenith/s3/file and pulls snapshot. The remote site should be zenith service or files in our format.
**zenith snapshot import** --from basebackup://<...> or path
Creates a new snapshot out of running postgres via basebackup protocol or basebackup files.
**zenith snapshot export**
Starts read-only postgres over this snapshot and exports data in some format (pg_dump, or COPY TO on some/all tables). One of the options may be zenith own format which is handy for us (but I think just tar of basebackup would be okay).
**zenith snapshot diff** snap1 snap2
Shows size of data changed between two snapshots. We also may provide options to diff schema/data in tables. To do that start temp read-only postgreses.
**zenith snapshot destroy**
## pitr
Pitr represents wal stream and ttl policy for that stream
Here I list some objectives to keep in mind when discussing zenith-local design and a proposal that brings all components together. Your comments on both parts are very welcome.
#### Why do we need it?
- For distribution - this easy to use binary will help us to build adoption among developers.
- For internal use - to test all components together.
In my understanding, we consider it to be just a mock-up version of zenith-cloud.
> Question: How much should we care about durability and security issues for a local setup?
#### Why is it better than a simple local postgres?
- Easy one-line setup. As simple as `cargo install zenith && zenith start`
- Quick and cheap creation of compute nodes over the same storage.
> Question: How can we describe a use-case for this feature?
- Zenith-local can work with S3 directly.
- Push and pull images (snapshots) to remote S3 to exchange data with other users.
- Quick and cheap snapshot checkouts to switch back and forth in the database history.
> Question: Do we want it in the very first release? This feature seems quite complicated.
#### Distribution:
Ideally, just one binary that incorporates all elements we need.
> Question: Let's discuss pros and cons of having a separate package with modified PostgreSQL.
#### Components:
- **zenith-CLI** - interface for end-users. Turns commands to REST requests and handles responces to show them in a user-friendly way.
CLI proposal is here https://github.com/libzenith/rfcs/blob/003-laptop-cli.md/003-laptop-cli.md
WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src/bin/cli
- **zenith-console** - WEB UI with same functionality as CLI.
>Note: not for the first release.
- **zenith-local** - entrypoint. Service that starts all other components and handles REST API requests. See REST API proposal below.
> Idea: spawn all other components as child processes, so that we could shutdown everything by stopping zenith-local.
- **zenith-pageserver** - consists of a storage and WAL-replaying service (modified PG in current implementation).
> Question: Probably, for local setup we should be able to bypass page-storage and interact directly with S3 to avoid double caching in shared buffers and page-server?
WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src
- **zenith-S3** - stores base images of the database and WAL in S3 object storage. Import and export images from/to zenith.
> Question: How should it operate in a local setup? Will we manage it ourselves or ask user to provide credentials for existing S3 object storage (i.e. minio)?
> Question: Do we use it together with local page store or they are interchangeable?
WIP code is ???
- **zenith-safekeeper** - receives WAL from postgres, stores it durably, answers to Postgres that "sync" is succeed.
> Question: How should it operate in a local setup? In my understanding it should push WAL directly to S3 (if we use it) or store all data locally (if we use local page storage). The latter option seems meaningless (extra overhead and no gain), but it is still good to test the system.
WIP code is here: https://github.com/libzenith/postgres/tree/main/src/bin/safekeeper
- **zenith-computenode** - bottomless PostgreSQL, ideally upstream, but for a start - our modified version. User can quickly create and destroy them and work with it as a regular postgres database.
WIP code is in main branch and here: https://github.com/libzenith/postgres/commits/compute_node
#### REST API:
Service endpoint: `http://localhost:3000`
Resources:
- /storages - Where data lives: zenith-pageserver or zenith-s3
- /pgs - Postgres - zenith-computenode
- /snapshots - snapshots **TODO**
>Question: Do we want to extend this API to manage zenith components? I.e. start page-server, manage safekeepers and so on? Or they will be hardcoded to just start once and for all?
Methods and their mapping to CLI:
- /storages - zenith-pageserver or zenith-s3
CLI | REST API
------------- | -------------
storage attach -n name --type [native\s3] --path=[datadir\URL] | PUT -d { "name": "name", "type": "native", "path": "/tmp" } /storages
storage detach -n name | DELETE /storages/:storage_name
storage list | GET /storages
storage show -n name | GET /storages/:storage_name
- /pgs - zenith-computenode
CLI | REST API
------------- | -------------
pg create -n name --s storage_name | PUT -d { "name": "name", "storage_name": "storage_name" } /pgs
pg destroy -n name | DELETE /pgs/:pg_name
pg start -n name --replica | POST -d {"action": "start", "is_replica":"replica"} /pgs/:pg_name /actions
pg stop -n name | POST -d {"action": "stop"} /pgs/:pg_name /actions
pg promote -n name | POST -d {"action": "promote"} /pgs/:pg_name /actions
Zenith CLI allows you to operate database clusters (catalog clusters) and their commit history locally and in the cloud. Since ANSI calls them catalog clusters and cluster is a loaded term in the modern infrastructure we will call it "catalog".
# CLI v2 (after chatting with Carl)
Zenith introduces the notion of a repository.
```bash
zenith init
zenith clone zenith://zenith.tech/piedpiper/northwind -- clones a repo to the northwind directory
```
Once you have a cluster catalog you can explore it
```bash
zenith log -- returns a list of commits
zenith status -- returns if there are changes in the catalog that can be committed
zenith commit -- commits the changes and generates a new commit hash
zenith branch experimental <hash> -- creates a branch called testdb based on a given commit hash
```
To make changes in the catalog you need to run compute nodes
```bash
-- here is how you a compute node
zenith start /home/pipedpiper/northwind:main -- starts a compute instance
zenith start zenith://zenith.tech/northwind:main -- starts a compute instance in the cloud
-- you can start a compute node against any hash or branch
zenith start /home/pipedpiper/northwind:experimental --port 8008 -- start anothe compute instance (on different port)
-- you can start a compute node against any hash or branch
zenith start /home/pipedpiper/northwind:<hash> --port 8009 -- start anothe compute instance (on different port)
-- After running some DML you can run
-- zenith status and see how there are two WAL streams one on top of
Here is a proposal about implementing push/pull mechanics between pageservers. We also want to be able to push/pull to S3 but that would depend on the exact storage format so we don't touch that in this proposal.
## Origin management
The origin represents connection info for some remote pageserver. Let's use here same commands as git uses except using explicit list subcommand (git uses `origin -v` for that).
```
zenith origin add <name> <connection_uri>
zenith origin list
zenith origin remove <name>
```
Connection URI a string of form `postgresql://user:pass@hostname:port` (https://www.postgresql.org/docs/13/libpq-connect.html#id-1.7.3.8.3.6). We can start with libpq password auth and later add support for client certs or require ssh as transport or invent some other kind of transport.
Behind the scenes, this commands may update toml file inside .zenith directory.
## Push
### Pushing branch
```
zenith push mybranch cloudserver # push to eponymous branch in cloudserver
zenith push mybranch cloudserver:otherbranch # push to a different branch in cloudserver
```
Exact mechanics would be slightly different in the following situations:
1) Destination branch does not exist.
That is the simplest scenario. We can just create an empty branch (or timeline in internal terminology) and transfer all the pages/records that we have in our timeline. Right now each timeline is quite independent of other timelines so I suggest skipping any checks that there is a common ancestor and just fill it with data. Later when CoW timelines will land to the pageserver we may add that check and decide whether this timeline belongs to this pageserver repository or not [*].
The exact mechanics may be the following:
* CLI asks local pageserver to perform push and hands over connection uri: `perform_push <branch_name> <uri>`.
* local pageserver connects to the remote pageserver and runs `branch_push <branch_name> <timetine_id>`
Handler for branch_create would create destination timeline and switch connection to copyboth mode.
* Sending pageserver may start iterator on that timeline and send all the records as copy messages.
2) Destination branch exists and latest_valid_lsn is less than ours.
In this case, we need to send missing records. To do that we need to find all pages that were changed since that remote LSN. Right now we don't have any tracking mechanism for that, so let's just iterate over all records and send ones that are newer than remote LSN. Later we probably should add a sparse bitmap that would track changed pages to avoid full scan.
3) Destination branch exists and latest_valid_lsn is bigger than ours.
In this case, we can't push to that branch. We can only pull.
### Pulling branch
Here we need to handle the same three cases, but also keep in mind that local pageserver can be behind NAT and we can't trivially re-use pushing by asking remote to 'perform_push' to our address. So we would need a new set of commands:
* CLI calls `perform_pull <branch_name> <uri>` on local pageserver.
* local pageserver calls `branch_pull <branch_name> <timetine_id>` on remote pageserver.
* remote pageserver sends records in our direction
But despite the different set of commands code that performs iteration over records and receiving code that inserts that records can be the same for both pull and push.
[*] It looks to me that there are two different possible approaches to handling unrelated timelines:
1) Allow storing unrelated timelines in one repo. Some timelines may have parents and some may not.
2) Transparently create and manage several repositories in one pageserver.
But that is the topic for a separate RFC/discussion.
While working on export/import commands, I understood that they fit really well into "snapshot-first design".
We may think about backups as snapshots in a different format (i.e plain pgdata format, basebackup tar format, WAL-G format (if they want to support it) and so on). They use same storage API, the only difference is the code that packs/unpacks files.
Even if zenith aims to maintains durability using it's own snapshots, backups will be useful for uploading data from postges to zenith.
So here is an attemt to design consistent CLI for diferent usage scenarios:
#### 1. Start empty pageserver.
That is what we have now.
Init empty pageserver using `initdb` in temporary directory.
`--storage_dest=FILE_PREFIX | S3_PREFIX |...` option defines object storage type, all other parameters are passed via env variables. Inspired by WAL-G style naming : https://wal-g.readthedocs.io/STORAGES/.
Save`storage_dest` and other parameters in config.
Push snapshots to `storage_dest` in background.
```
zenith init --storage_dest=S3_PREFIX
zenith start
```
#### 2. Restart pageserver (manually or crash-recovery).
Take `storage_dest` from pageserver config, start pageserver from latest snapshot in `storage_dest`.
Push snapshots to `storage_dest` in background.
```
zenith start
```
#### 3. Import.
Start pageserver from existing snapshot.
Path to snapshot provided via `--snapshot_path=FILE_PREFIX | S3_PREFIX | ...`
Do not save `snapshot_path` and `snapshot_format` in config, as it is a one-time operation.
Save`storage_dest` parameters in config.
Push snapshots to `storage_dest` in background.
```
//I.e. we want to start zenith on top of existing $PGDATA and use s3 as a persistent storage.
- Easy snapshots; simple snapshot and branch management.
- Allow cloud-based snapshot/branch management.
- Allow cloud-centric branching; decouple branch state from running pageserver.
- Allow customer ownership of data via s3 permissions.
- Provide same or better performance for typical workloads, vs plain postgres.
Non-goals:
- Service database reads from s3 (reads should be serviced from the pageserver cache).
- Keep every version of every page / Implement point-in-time recovery (possibly a future paid feature, based on WAL replay from an existing snapshot).
## Principle of operation
The database “lives in s3”. This means that all of the long term page storage is in s3, and the “live database”-- the version that lives in the pageserver-- is a set of “dirty pages” that haven’t yet been written back to s3.
In practice, this is mostly similar to storing frequent snapshots to s3 of a database that lives primarily elsewhere.
The main difference is that s3 is authoritative about which branches exist; pageservers consume branches, snapshots, and related metadata by reading them from s3. This allows cloud-based management of branches and snapshots, regardless of whether a pageserver is running or not.
It’s expected that a pageserver should keep a copy of all pages, to shield users from s3 latency. A cheap/slow pageserver that falls back to s3 for some reads would be possible, but doesn’t seem very useful right now.
Because s3 keeps all history, and the safekeeper(s) preserve any WAL records needed to reconstruct the most recent changes, the pageserver can store dirty pages in RAM or using non-durable local storage; this should allow very good write performance, since there is no need for fsync or journaling.
Objects in s3 are immutable snapshots, never to be modified once written (only deleted).
Objects in s3 are files, each containing a set of pages for some branch/relation/segment as of a specific time (LSN). A snapshot could be complete (meaning it has a copy of every page), or it could be incremental (containing only the pages that were modified since the previous snapshot). It’s expected that most snapshots are incremental to keep storage costs low.
It’s expected that the pageserver would upload new snapshot objects frequently, e.g. somewhere between 30 seconds and 15 minutes, depending on cost/performance balance.
No-longer needed snapshots can be “squashed”-- meaning snapshot N and snapshot N+1 can be read by some cloud agent software, which writes out a new object containing the combined set of pages (keeping only the newest version of each page) and then deletes the original snapshots.
A pageserver only needs to store the set of pages needed to satisfy operations in flight: if a snapshot is still being written, the pageserver needs to hold historical pages so that snapshot captures a consistent moment in time (similar to what is needed to satisfy a slow replica).
WAL records can be discarded once a snapshot has been stored to s3. (Unless we want to keep them longer as part of a point-in-time recovery feature.)
## Pageserver operation
To start a pageserver from a stored snapshot, the pageserver downloads a set of snapshots sufficient to start handling requests. We assume this includes the latest copy of every page, though it might be possible to start handling requests early, and retrieve pages for the first time only when needed.
To halt a pageserver, one final snapshot should be written containing all pending WAL updates; then the pageserver and safekeepers can shut down.
It’s assumed there is some cloud management service that ensures only one pageserver is active and servicing writes to a given branch.
The pageserver needs to be able to track whether a given page has been modified since the last snapshot, and should be able to produce the set of dirty pages efficiently to create a new snapshot.
The pageserver need only store pages that are “reachable” from a particular LSN. For example, a page may be written four times, at LSN 100, 200, 300, and 400. If no snapshot is being created when LSN 200 is written, the page at LSN 100 can be discarded. If a snapshot is triggered when the pageserver is at LSN 299, the pageserver must preserve the page from LSN 200 until that snapshot is complete. As before, the page at LSN 300 can be discarded when the LSN 400 pages is written (regardless of whether the LSN 200 snapshot has completed.)
If the pageserver is servicing multiple branches, those branches may contain common history. While it would be possible to serve branches with zero knowledge of their common history, a pageserver could save a lot of space using an awareness of branch history to share the common set of pages. Computing the “liveness” of a historical page may be tricky in the face of multiple branches.
The pageserver may store dirty pages to memory or to local block storage; any local block storage format is only temporary “overflow” storage, and is not expected to be readable by future software versions.
The pageserver may store clean pages (those that are captured in a snapshot) any way it likes: in memory, in a local filesystem (possibly keeping a local copy of the snapshot file), or using some custom storage format. Reading pages from s3 would be functional, but is expected to be prohibitively slow.
The mechanism for recovery after a pageserver failure is WAL redo. If we find that too slow in some situations (e.g. write-heavy workload causes long startup), we can write more frequent snapshots to keep the number of outstanding WAL records low. If that’s still not good enough, we could look at other options (e.g. redundant pageserver or an EBS page journal).
A read-only pageserver is possible; such a pageserver could be a read-only cache of a specific snapshot, or could auto-update to the latest snapshot on some branch. Either way, no safekeeper is required. Multiple read-only pageservers could exist for a single branch or snapshot.
## Cloud snapshot manager operation
Cloud software may wish to do the following operations (commanded by a user, or based on some pre-programmed policy or other cloud agent):
Create/delete/clone/rename a database
Create a new branch (possibly from a historical snapshot)
Start/stop the pageserver/safekeeper on a branch
List databases/branches/snapshots that are visible to this user account
Some metadata operations (e.g. list branches/snapshots of a particular db) could be performed by scanning the contents of a bucket and inspecting the file headers of each snapshot object. This might not be fast enough; it might be necessary to build a metadata service that can respond more quickly to some queries.
This is especially true if there are public databases: there may be many thousands of buckets that are public, and scanning all of them is not a practical strategy for answering metadata queries.
## Snapshot names, deletion and concurrency
There may be race conditions between operations-- in particular, a “squash” operation may replace two snapshot objects (A, B) with some combined object (C). Since C is logically equivalent to B, anything that attempts to access B should be able to seamlessly switch over to C. It’s assumed that concurrent delete won’t disrupt a read in flight, but it may be possible for some process to read B’s header, and then discover on the next operation that B is gone.
For this reason, any attempted read should attempt a fallback procedure (list objects; search list for an equivalent object) if an attempted read fails. This requires a predictable naming scheme, e.g. `XXXX_YYYY_ZZZZ_DDDD`, where `XXXX` is the branch unique id, and `YYYY` and `ZZZZ` are the starting/ending LSN values. `DDDD` is a timestamp indicating when the object was created; this is used to disambiguate a series of empty snapshots, or to help a snapshot policy engine understand which snapshots should be kept or discarded.
## Branching
A user may request a new branch from the cloud user interface. There is a sequence of things that needs to happen:
- If the branch is supposed to be based on the latest contents, the pageserver should perform an immediate snapshot. This is the parent snapshot for the new branch.
- Cloud software should create the new branch, by generating a new (random) unique branch identifier, and creating a placeholder snapshot object.
- The placeholder object is an empty snapshot containing only metadata (which anchors it to the right parent history) and no pages.
- The placeholder can be discarded when the first snapshot (containing data) is completed. Discarding is equivalent to squashing, when the snapshot contains no data.
- If the branch needs to be started immediately, a pageserver should be notified that it needs to start servicing the branch. This may not be the same pageserver that services the parent branch, though the common history may make it the best choice.
Some of these steps could be combined into the pageserver, but that process would not be possible under all cases (e.g. if no pageserver is currently running, or if the branch is based on an older snapshot, or if a different pageserver will be serving the new branch). Regardless of which software drives the process, the result should look the same.
## Long-term file format
Snapshot files (and any other object stored in s3) must be readable by future software versions.
It should be possible to build multiple tools (in addition to the pageserver) that can read and write this file format-- for example, to allow cloud snapshot management.
Files should contain the following metadata, in addition to the set of pages:
- The version of the file format.
- A unique identifier for this branch (should be worldwide-unique and unchanging).
- Optionally, any human-readable names assigned to this branch (for management UI/debugging/logging).
- For incremental snapshots, the identifier of the predecessor snapshot. For new branches, this will be the parent snapshot (the point at which history diverges).
- The location of the predecessor branch snapshot, if different from this branch’s location.
- The LSN range `(parent, latest]` for this snapshot. For complete snapshots, the parent LSN can be 0.
- The UTC timestamp of the snapshot creation (which may be different from the time of its highest LSN, if the database is idle).
- A SHA2 checksum over the entire file (excluding the checksum itself), to preserve file integrity.
A file may contain no pages, and an empty LSN range (probably `(latest, latest]`?), which serves as a placeholder for either a newly-created branch, or a snapshot of an idle database.
Any human-readable names stored in the file may fall out of date if database/branch renames are allowed; there may need to be a cloud metadata service to query (current name -> unique identifier). We may choose instead to not store human-readable names in the database, or treat them as debugging information only.
## S3 semantics, and other kinds of storage
For development and testing, it may be easier to use other kinds of storage in place of s3. For example, a directory full of files can substitute for an s3 bucket with multiple objects. This mode is expected to match the s3 semantics (e.g. don’t edit existing files or use symlinks). Unit tests may omit files entirely and use an in-memory mock bucket.
Some users may want to use a local or network filesystem in place of s3. This isn’t prohibited but it’s not a priority, either.
Alternate implementations of s3 should be supported, including Google Cloud Storage.
Azure Blob Storage should be supported. We assume (without evidence) that it’s semantically equivalent to s3 for this purpose.
The properties of s3 that we depend on are:
list objects
streaming read of entire object
read byte range from object
streaming write new object (may use multipart upload for better relialibity)
delete object (that should not disrupt an already-started read).
Uploaded files, restored backups, or s3 buckets controlled by users could contain malicious content. We should always validate that objects contain the content they’re supposed to. Incorrect, Corrupt or malicious-looking contents should cause software (cloud tools, pageserver) to fail gracefully.
## Notes
Possible simplifications, for a first draft implementation:
- Assume that dirty pages fit in pageserver RAM. Can use kernel virtual memory to page out to disk if needed. Can improve this later.
- Don’t worry about the details of the squashing process yet.
- Don’t implement cloud metadata service; try to make everything work using basic s3 list-objects and reads.
- Don’t implement rename, delete at first.
- Don’t implement public/private, just use s3 permissions.
- Don’t worry about sharing history yet-- each user has their own bucket and a full copy of all data.
- Don’t worry about history that spans multiple buckets.
- Don’t worry about s3 regions.
- Don’t support user-writeable s3 buckets; users get only read-only access at most.
Open questions:
- How important is point-in-time recovery? When should we add this? How should it work?
- Should snapshot files use compression?
- Should we use snapshots for async replication? A spare pageserver could stay mostly warmed up by consuming snapshots as they’re created.
- Should manual snapshots, or snapshots triggered by branch creation, be named differently from snapshots that are triggered by a snapshot policy?
- When a new branch is created, should it always be served by the same pageserver that owns its parent branch? When should we start a new pageserver?
- How can pageserver software upgrade be done with minimal downtime?
Here I tried to describe the current state of thinking about our storage subsystem as I understand it. Feel free to correct me. Also, I tried to address items from Heikki's TODO and be specific on some of the details.
## Overview

### MemStore
MemStore holds the data between `latest_snapshot_lsn` and `latest_lsn`. It consists of PageIndex that holds references to WAL records or pages, PageStore that stores recently materialized pages, and WalStore that stores recently received WAL.
### PageIndex
PageIndex is an ordered collection that maps `(BufferTag, LSN)` to one of the following references (by reference I mean some information that is needed to access that data, e.g. file_id and offset):
* PageStoreRef -- page offset in the PageStore
* LocalStoreRef -- snapshot_id and page offset inside of that snapshot
* WalStoreRef -- offset (and size optionally) of WalRecord in WalStore
PageIndex holds information about all the pages in all incremental snapshots and in the latest full snapshot. If we aren't using page compression inside snapshots we actually can avoid storing references to the full snapshot and calculate page offsets based on relation sizes metadata in the full snapshot (assuming that full snapshot stores pages sorted by page number). However, I would suggest embracing page compression from the beginning and treat all pages as variable-sized.
We assume that PageIndex is few orders of magnitude smaller than addressed data hence it should fit memory. We also don't care about crash tolerance as we can rebuild it from snapshots metadata and WAL records from WalStore or/and Safekeeper.
### WalStore
WalStore is a queue of recent WalRecords. I imagine that we can store recent WAL the same way as Postgres does -- as 16MB files on disk. On top of that, we can add some fixed-size cache that would keep some amount of segments in memory.
For now, we may rely on the Safekeeper to safely store that recent WAL. But generally, I think we can pack all S3 operations into the page server so that it would be also responsible for the recent WAL pushdown to S3 (and Safekeeper may just delete WAL that was confirmed as S3-durable by the page server).
### PageStore
PageStore is storage for recently materialized pages (or in other words cache of getPage results). It is also can be implemented as a file-based queue with some memory cache on top of it.
There are few possible options for PageStore:
a) we just add all recently materialized pages there (so several versions of the same page can be stored there) -- that is more or less how it happens now with the current RocksDB implementation.
b) overwrite older pages with the newer pages -- if there is no replica we probably don't need older pages. During page overwrite, we would also need to change PageStoreRef back to WalStoreRef in PageIndex.
I imagine that newly created pages would just be added to the back of PageStore (again in queue-like fashion) and this way there wouldn't be any meaningful ordering inside of that queue. When we are forming a new incremental snapshot we may prohibit any updates to the current set of pages in PageStore (giving up on single page version rule) and cut off that whole set when snapshot creation is complete.
With option b) we can also treat PageStor as an uncompleted increamental snapshot.
### LocalStore
LocalStore keeps the latest full snapshot and set of incremental snapshots on top of it. We add new snapshots when the number of changed pages grows bigger than a certain threshold.
## Granularity
By granularity, I mean a set of pages that goes into a certain full snapshot. Following things should be taken into account:
* can we shard big databases between page servers?
* how much time will we spend applying WAL to access certain pages with older LSN's?
* how many files do we create for a single database?
I can think of the following options here:
1. whole database goes to one full snapshot.
* +: we never create a lot of files for one database
* +: the approach is quite straightforward, moving data around is simple
* -: can not be sharded
* -: long recovery -- we always need to recover the whole database
2. table segment is the unit of snapshotting
* +: straightforward for sharding
* +: individual segment can be quickly recovered with sliced WAL
* -: full snapshot can be really small (e.g. when the corresponding segment consists of a single page) and we can blow amount of files. Then we would spend eternity in directory scans and the amount of metadata for sharding can be also quite big.
3. range-partitioned snapshots -- snapshot includes all pages between [BuffTagLo, BuffTagHi] mixing different relations, databases, and potentially clusters (albeit from one tenant only). When full snapshot outgrows a certain limit (could be also a few gigabytes) we split the snapshot in two during the next full snapshot write. That approach would also require pages sorted by BuffTag inside our snapshots.
* +: addresses all mentioned issues
* -: harder to implement
I think it is okay to start with table segments granularity and just check how we will perform in cases of lots of small tables and check is there any way besides c) to deal with it.
Both PageStore and WalStore should be "sharded" by this granularity level.
## Security
We can generate different IAM keys for each tenant and potentially share them with users (in read-only mode?) or even allow users to provide their S3 buckets credentials.
Also, S3 backups are usually encrypted by per-tenant privates keys. I'm not sure in what threat model such encryption would improve something (taking into account per-tenant IAM keys), but it seems that everybody is doing that (both AMZN and YNDX). Most likely that comes as a requirement about "cold backups" by some certification procedure.
## Dynamics
### WAL stream handling
When a new WAL record is received we need to parse BufferTags in that record and insert them in PageIndex with WalStoreRef as a value.
### getPage queries
Look up the page in PageIndex. If the value is a page reference then just respond with that page. If the referenced value is WAL record then find the most recent page with the same BuffTag (that is why we need ordering in PageIndex); recover it by applying WAL records; save it in PageStore; respond with that page.
### Starting page server without local data
* build set of latest full snapshots and incremental snapshots on top of them
* load all their metadata into PageIndex
* Safekeeper should connect soon and we can ask for a WAL stream starting from the latest incremental snapshot
* for databases that are connected to us through the Safekeeper we can start loading the set of the latest snapshots or we can do that lazily based on getPage request (I'd better avoid doing that lazily for now without some access stats from the previous run and just transfer all data for active database from S3 to LocalStore).
### Starting page server with local data (aka restart or reboot)
* check that local snapshot files are consistent with S3
### Snapshot creation
Track size of future snapshots based on info in MemStore and when it exceeds some threshold (taking into account our granularity level) create a new incremental snapshot. Always emit incremental snapshots from MemStore.
To create a new snapshot we need to walk through WalStore to get the list of all changed pages, sort it, and get the latest versions of that pages from PageStore or by WAL replay. It makes sense to maintain that set in memory while we are receiving the WAL stream to avoid parsing WAL during snapshot creation.
Full snapshot creation can be done by GC (or we can call that entity differently -- e.g. merger?) by merging the previous full snapshot with several incremental snapshots.
### S3 pushdown
When we have several full snapshots GC can push the old one with its increments to S3.
### Branch creation
Create a new timeline and replay sliced WAL up to a requested point. When the page is not in PageIndex ask the parent timeline about a page. Relation sizes are tricky.
## File formats
As far as I understand Bookfile/Aversion addresses versioning and serialization parts.
As for exact data that should go to snapshots I think it is the following for each snapshot:
* format version number
* set of key/values to interpret content (e.g. is page compression enabled, is that a full or incremental snapshot, previous snapshot id, is there WAL at the end on file, etc) -- it is up to a reader to decide what to do if some keys are missing or some unknow key are present. If we add something backward compatible to the file we can keep the version number.
* array of [BuffTag, corresponding offset in file] for pages -- IIUC that is analogous to ToC in Bookfile
* array of [(BuffTag, LSN), corresponding offset in file] for the WAL records
* pages, one by one
* WAL records, one by one
It is also important to be able to load metadata quickly since it would be one of the main factors impacting the time of page server start. E.g. if would store/cache about 10TB of data per page server, the size of uncompressed page references would be about 30GB (10TB / ( 8192 bytes page size / ( ~18 bytes per ObjectTag + 8 bytes offset in the file))).
1) Since our ToC/array of entries can be sorted by ObjectTag we can store the whole BufferTag only when realtion_id is changed and store only delta-encoded offsets for a given relation. That would reduce the average per-page metadata size to something less than 4 bytes instead of 26 (assuming that pages would follow the same order and offset delatas would be small).
2) It makes sense to keep ToC at the beginning of the file to avoid extra seeks to locate it. Doesn't matter too much with the local files but matters on S3 -- if we are accessing a lot of ~1Gb files with the size of metadata ~ 1Mb then the time to transfer this metadata would be comparable with access latency itself (which is about a half of a second). So by slurping metadata with one read of file header instead of N reads we can improve the speed of page server start by this N factor.
I think both of that optimizations can be done later, but that is something to keep in mind when we are designing our storage serialization routines.
Also, there were some discussions about how to embed WAL in incremental snapshots. So far following ideas were mentioned:
1. snapshot lsn=200, includes WAL in range 200-300
2. snapshot lsn=200, includes WAL in range 100-200
3. data snapshots are separated from WAL snapshots
Both options 2 and 3 look good. I'm inclined towards option 3 as it would allow us to apply different S3 pushdown strategies for data and WAL files (e.g. we may keep data snapshot until the next full snapshot, but we may push WAL snapshot to S3 just when they appeared if there are no replicas).
Extracted from this [PR](https://github.com/zenithdb/rfcs/pull/13)
## Motivation
In some situations, safekeeper (SK) needs coordination with other SK's that serve the same tenant:
1. WAL deletion. SK needs to know what WAL was already safely replicated to delete it. Now we keep WAL indefinitely.
2. Deciding on who is sending WAL to the pageserver. Now sending SK crash may lead to a livelock where nobody sends WAL to the pageserver.
3. To enable SK to SK direct recovery without involving the compute
## Summary
Compute node has connection strings to each safekeeper. During each compute->safekeeper connection establishment, the compute node should pass down all that connection strings to each safekeeper. With that info, safekeepers may establish Postgres connections to each other and periodically send ping messages with LSN payload.
## Components
safekeeper, compute, compute<->safekeeper protocol, possibly console (group SK addresses)
## Proposed implementation
Each safekeeper can periodically ping all its peers and share connectivity and liveness info. If the ping was not receiver for, let's say, four ping periods, we may consider sending safekeeper as dead. That would mean some of the alive safekeepers should connect to the pageserver. One way to decide which one exactly: `make_connection = my_node_id == min(alive_nodes)`
Since safekeepers are multi-tenant, we may establish either per-tenant physical connections or per-safekeeper ones. So it makes sense to group "logical" connections between corresponding tenants on different nodes into a single physical connection. That means that we should implement an interconnect thread that maintains physical connections and periodically broadcasts info about all tenants.
Right now console may assign any 3 SK addresses to a given compute node. That may lead to a high number of gossip connections between SK's. Instead, we can assign safekeeper triples to the compute node. But if we want to "break"/" change" group by an ad-hoc action, we can do it.
### Corner cases
- Current safekeeper may be alive but may not have connectivity to the pageserver
To address that, we need to gossip visibility info. Based on that info, we may define SK as alive only when it can connect to the pageserver.
- Current safekeeper may be alive but may not have connectivity with the compute node.
We may broadcast last_received_lsn and presence of compute connection and decide who is alive based on that.
- It is tricky to decide when to shut down gossip connections because we need to be sure that pageserver got all the committed (in the distributed sense, so local SK info is not enough) records, and it may never lose them. It is not a strict requirement since `--sync-safekeepers` that happen before the compute start will allow the pageserver to consume missing WAL, but it is better to do that in the background. So the condition may look like that: `majority_max(flush_lsn) == pageserver_s3_lsn` Here we rely on the two facts:
- that `--sync-safekeepers` happened after the compute shutdown, and it advanced local commit_lsn's allowing pageserver to consume that WAL.
- we wait for the `pageserver_s3_lsn` advancement to avoid pageserver's last_received_lsn/disk_consistent_lsn going backward due to the disk/hardware failure and subsequent S3 recovery
If those conditions are not met, we will have some gossip activity (but that may be okay).
## Pros/cons
Pros:
- distributed, does not introduce new services (like etcd), does not add console as a storage dependency
- lays the foundation for gossip-based recovery
Cons:
- Only compute knows a set of safekeepers, but they should communicate even without compute node. In case of safekeepers restart, we will lose that info and can't gossip anymore. Hence we can't trim some WAL tail until the compute node start. Also, it is ugly.
- If the console assigns a random set of safekeepers to each Postgres, we may end up in a situation where each safekeeper needs to have a connection with all other safekeepers. We can group safekeepers into isolated triples in the console to avoid that. Then "mixing" would happen only if we do rebalancing.
## Alternative implementation
We can have a selected node (e.g., console) with everybody reporting to it.
## Security implications
We don't increase the attack surface here. Communication can happen in a private network that is not exposed to users.
## Scalability implications
The only thing that may grow as we grow the number of computes is the number of gossip connections. But if we group safekeepers and assign a compute node to the random SK triple, the number of connections would be constant.
Initially created [here](https://github.com/zenithdb/rfcs/pull/16) by @kelvich.
That it is an alternative to (014-safekeeper-gossip)[]
## Motivation
As in 014-safekeeper-gossip we need to solve the following problems:
* Trim WAL on safekeepers
* Decide on which SK should push WAL to the S3
* Decide on which SK should forward WAL to the pageserver
* Decide on when to shut down SK<->pageserver connection
This RFC suggests a more generic and hopefully more manageable way to address those problems. However, unlike 014-safekeeper-gossip, it does not bring us any closer to safekeeper-to-safekeeper recovery but rather unties two sets of different issues we previously wanted to solve with gossip.
Also, with this approach, we would not need "call me maybe" anymore, and the pageserver will have all the data required to understand that it needs to reconnect to another safekeeper.
## Summary
Instead of p2p gossip, let's have a centralized broker where all the storage nodes report per-timeline state. Each storage node should have a `--broker-url=1.2.3.4` CLI param.
Here I propose two ways to do that. After a lot of arguing with myself, I'm leaning towards the etcd approach. My arguments for it are in the pros/cons section. Both options require adding a Grpc client in our codebase either directly or as an etcd dependency.
## Non-goals
That RFC does *not* suggest moving the compute to pageserver and compute to safekeeper mappings out of the console. The console is still the only place in the cluster responsible for the persistency of that info. So I'm implying that each pageserver and safekeeper exactly knows what timelines he serves, as it currently is. We need some mechanism for a new pageserver to discover mapping info, but that is out of the scope of this RFC.
## Impacted components
pageserver, safekeeper
adds either etcd or console as a storage dependency
## Possible implementation: custom message broker in the console
We've decided to go with an etcd approach instead of the message broker.
<detailsclosed>
<summary>Original suggestion</summary>
<br>
We can add a Grpc service in the console that acts as a message broker since the console knows the addresses of all the components. The broker can ignore the payload and only redirect messages. So, for example, each safekeeper may send a message to the peering safekeepers or to the pageserver responsible for a given timeline.
Message format could be `{sender, destination, payload}`.
The destination is either:
1.`sk_#{tenant}_#{timeline}` -- to be broadcasted on all safekeepers, responsible for that timeline, or
2.`pserver_#{tenant}_#{timeline}` -- to be broadcasted on all pageservers, responsible for that timeline
Sender is either:
1.`sk_#{sk_id}`, or
2.`pserver_#{pserver_id}`
I can think of the following behavior to address our original problems:
* WAL trimming
Each safekeeper periodically broadcasts `(write_lsn, commit_lsn)` to all peering (peering == responsible for that timeline) safekeepers
* Decide on which SK should push WAL to the S3
Each safekeeper periodically broadcasts `i_am_alive_#{current_timestamp}` message to all peering safekeepers. That way, safekeepers may maintain the vector of alive peers (loose one, with false negatives). Alive safekeeper with the minimal id pushes data to S3.
* Decide on which SK should forward WAL to the pageserver
Each safekeeper periodically sends (write_lsn, commit_lsn, compute_connected) to the relevant pageservers. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`.
Pageserver connection to the safekeeper triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore.
Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection).
* Decide on when to shutdown sk<->pageserver connection
Again, pageserver would have all the info to understand when to shut down the safekeeper connection.
### Scalability
One node is enough (c) No, seriously, it is enough.
### High Availability
Broker lives in the console, so we can rely on k8s maintaining the console app alive.
If the console is down, we won't trim WAL and reconnect the pageserver to another safekeeper. But, at the same, if the console is down, we already can't accept new compute connections and start stopped computes, so we are making things a bit worse, but not dramatically.
### Interactions
```
.________________.
sk_1 <-> | | <-> pserver_1
... | Console broker | ...
sk_n <-> |________________| <-> pserver_m
```
</details>
## Implementation: etcd state store
Alternatively, we can set up `etcd` and maintain the following data structure in it:
```ruby
"compute_#{tenant}_#{timeline}"=>{
safekeepers=>{
"sk_#{sk_id}"=>{
write_lsn:"0/AEDF130",
commit_lsn:"0/AEDF100",
compute_connected:true,
last_updated:1642621138,
},
}
}
```
As etcd doesn't support field updates in the nested objects that translates to the following set of keys:
Each storage node can subscribe to the relevant sets of keys and maintain a local view of that structure. So in terms of the data flow, everything is the same as in the previous approach. Still, we can avoid implementing the message broker and prevent runtime storage dependency on a console.
### Safekeeper address discovery
During the startup safekeeper should publish the address he is listening on as the part of `{"sk_#{sk_id}" => ip_address}`. Then the pageserver can resolve `sk_#{sk_id}` to the actual address. This way it would work both locally and in the cloud setup. Safekeeper should have `--advertised-address` CLI option so that we can listen on e.g. 0.0.0.0 but advertize something more useful.
### Safekeeper behavior
For each timeline safekeeper periodically broadcasts `compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/*` fields. It subscribes to changes of `compute_#{tenant}_#{timeline}` -- that way safekeeper will have an information about peering safekeepers.
That amount of information is enough to properly trim WAL. To decide on who is pushing the data to S3 safekeeper may use etcd leases or broadcast a timestamp and hence track who is alive.
### Pageserver behavior
Pageserver subscribes to `compute_#{tenant}_#{timeline}` for each tenant it owns. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`.
Pageserver connection to the safekeeper can be triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore.
As an alternative to compute_connected, we can track timestamp of the latest message arrived to safekeeper from compute. Usually compute broadcasts KeepAlive to all safekeepers every second, so it'll be updated every second when connection is ok. Then the connection can be considered down when this timestamp isn't updated for a several seconds.
This will help to faster detect issues with safekeeper (and switch to another) in the following cases:
when compute failed but TCP connection stays alive until timeout (usually about a minute)
when safekeeper failed and didn't set compute_connected to false
Another way to deal with [2] is to process (write_lsn, commit_lsn, compute_connected) as a KeepAlive on the pageserver side and detect issues when sk_id don't send anything for some time. This way is fully compliant to this RFC.
Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection).
### Interactions
```
.________________.
sk_1 <-> | | <-> pserver_1
... | etcd | ...
sk_n <-> |________________| <-> pserver_m
```
### Sequence diagrams for different workflows
#### Cluster startup
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
PS1->>M: subscribe to updates to state of timeline N
C->>+SK1: WAL push
loop constantly update current lsns
SK1->>-M: I'm at lsn A
end
C->>+SK2: WAL push
loop constantly update current lsns
SK2->>-M: I'm at lsn B
end
C->>+SK3: WAL push
loop constantly update current lsns
SK3->>-M: I'm at lsn C
end
loop request pages
C->>+PS1: get_page@lsn
PS1->>-C: page image
end
M->>PS1: New compute appeared for timeline N. SK1 at A, SK2 at B, SK3 at C
note over PS1: Say SK1 at A=200, SK2 at B=150 SK3 at C=100 <br> so connect to SK1 because it is the most up to date one
PS1->>SK1: start replication
```
#### Behavour of services during typical operations
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
note over C,M: Scenario 1: Pageserver checkpoint
note over PS1: Upload data to S3
PS1->>M: Update remote consistent lsn
M->>SK1: propagate remote consistent lsn update
note over SK1: truncate WAL up to remote consistent lsn
M->>SK2: propagate remote consistent lsn update
note over SK2: truncate WAL up to remote consistent lsn
M->>SK3: propagate remote consistent lsn update
note over SK3: truncate WAL up to remote consistent lsn
SK1->>SK2: Fetch WAL delta between 100 (SK1) and 200 (SK2)
note over C,M: Scenario 3: PS1 detects that SK1 is lagging behind: Connection from SK1 is broken or there is no messages from it in 30 seconds.
note over PS1: e.g. SK2 is at 150, SK3 is at 100, chose SK2 as a new replication source
PS1->>SK2: start replication
```
#### Behaviour during timeline relocation
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
note over C,M: Timeline is being relocated from PS1 to PS2
O->>+PS2: Attach timeline
PS2->>-O: 202 Accepted if timeline exists in S3
note over PS2: Download timeline from S3
note over O: Poll for timeline download (or subscribe to metadata service)
loop wait for attach to complete
O->>PS2: timeline detail should answer that timeline is ready
end
PS2->>M: Register downloaded timeline
PS2->>M: Get safekeepers for timeline, subscribe to changes
PS2->>SK1: Start replication to catch up
note over O: PS2 catched up, time to switch compute
O->>C: Restart compute with new pageserver url in config
note over C: Wal push is restarted
loop request pages
C->>+PS2: get_page@lsn
PS2->>-C: page image
end
O->>PS1: detach timeline
note over C,M: Scenario 1: Attach call failed
O--xPS2: Attach timeline
note over O: The operation can be safely retried, <br> if we hit some threshold we can try another pageserver
note over C,M: Scenario 2: Attach succeeded but pageserver failed to download the data or start replication
loop wait for attach to complete
O--xPS2: timeline detail should answer that timeline is ready
end
note over O: Can wait for a timeout, and then try another pageserver <br> there should be a limit on number of different pageservers to try
note over C,M: Scenario 3: Detach fails
O--xPS1: Detach timeline
note over O: can be retried, if continues to fail might lead to data duplication in s3
```
# Pros/cons
## Console broker/etcd vs gossip:
Gossip pros:
* gossip allows running storage without the console or etcd
Console broker/etcd pros:
* simpler
* solves "call me maybe" as well
* avoid possible N-to-N connection issues with gossip without grouping safekeepers in pre-defined triples
## Console broker vs. etcd:
Initially, I wanted to avoid etcd as a dependency mostly because I've seen how painful for Clickhouse was their ZooKeeper dependency: in each chat, at each conference, people were complaining about configuration and maintenance barriers with ZooKeeper. It was that bad that ClickHouse re-implemented ZooKeeper to embed it: https://clickhouse.com/docs/en/operations/clickhouse-keeper/.
But with an etcd we are in a bit different situation:
1. We don't need persistency and strong consistency guarantees for the data we store in the etcd
2. etcd uses Grpc as a protocol, and messages are pretty simple
So it looks like implementing in-mem store with etcd interface is straightforward thing _if we will want that in future_. At the same time, we can avoid implementing it right now, and we will be able to run local zenith installation with etcd running somewhere in the background (as opposed to building and running console, which in turn requires Postgres).
- Google fuchsia: [https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/TEMPLATE](https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/TEMPLATE)
- Should be submitted in a pull request with and full RFC text in a commited markdown file and copy of the Summary and Motivation sections also included in the PR body.
- RFC should be published for review before most of the actual code is written. This isn’t a strict rule, don’t hesitate to experiment and build a POC in parallel with writing an RFC.
- Add labels to the PR in the same manner as you do Issues. Example TBD
- Request the review from your peers. Reviewing the RFCs from your peers is a priority, same as reviewing the actual code.
- The Tech Design RFC should evolve based on the feedback received and further during the development phase if problems are discovered with the taken approach
- RFCs stop evolving once the consensus is found or the proposal is implemented and merged.
- RFCs are not intended as a documentation that’s kept up to date **after** the implementation is finished. Do not update the Tech Design RFC when merged functionality evolves later on. In such situation a new RFC may be appropriate.
### RFC template
Note, a lot of the sections are marked as ‘if relevant’. They are included into the template as a reminder and to help inspiration.
One of the resource consumption limits for free-tier users is a cluster size limit.
To enforce it, we need to calculate the timeline size and check if the limit is reached before relation create/extend operations.
If the limit is reached, the query must fail with some meaningful error/warning.
We may want to exempt some operations from the quota to allow users free space to fit back into the limit.
The stateless compute node that performs validation is separate from the storage that calculates the usage, so we need to exchange cluster size information between those components.
## Motivation
Limit the maximum size of a PostgreSQL instance to limit free tier users (and other tiers in the future).
First of all, this is needed to control our free tier production costs.
Another reason to limit resources is risk management — we haven't (fully) tested and optimized zenith for big clusters,
so we don't want to give users access to the functionality that we don't think is ready.
## Components
* pageserver - calculate the size consumed by a timeline and add it to the feedback message.
* safekeeper - pass feedback message from pageserver to compute.
* compute - receive feedback message, enforce size limit based on GUC `zenith.max_cluster_size`.
* console - set and update `zenith.max_cluster_size` setting
## Proposed implementation
First of all, it's necessary to define timeline size.
The current approach is to count all data, including SLRUs. (not including WAL)
Here we think of it as a physical disk underneath the Postgres cluster.
This is how the `LOGICAL_TIMELINE_SIZE` metric is implemented in the pageserver.
Alternatively, we could count only relation data. As in pg_database_size().
This approach is somewhat more user-friendly because it is the data that is really affected by the user.
On the other hand, it puts us in a weaker position than other services, i.e., RDS.
We will need to refactor the timeline_size counter or add another counter to implement it.
Timeline size is updated during wal digestion. It is not versioned and is valid at the last_received_lsn moment.
Then this size should be reported to compute node.
`current_timeline_size` value is included in the walreceiver's custom feedback message: `ZenithFeedback.`
(PR about protocol changes https://github.com/zenithdb/zenith/pull/1037).
This message is received by the safekeeper and propagated to compute node as a part of `AppendResponse`.
Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable.
And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > zenith.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so.
(see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html))
TODO:
We can allow autovacuum processes to bypass this check, simply checking `IsAutoVacuumWorkerProcess()`.
It would be nice to allow manual VACUUM and VACUUM FULL to bypass the check, but it's uneasy to distinguish these operations at the low level.
See issues https://github.com/neondatabase/neon/issues/1245
https://github.com/zenithdb/zenith/issues/1445
TODO:
We should warn users if the limit is soon to be reached.
### **Reliability, failure modes and corner cases**
1.`current_timeline_size` is valid at the last received and digested by pageserver lsn.
If pageserver lags behind compute node, `current_timeline_size` will lag too. This lag can be tuned using backpressure, but it is not expected to be 0 all the time.
So transactions that happen in this lsn range may cause limit overflow. Especially operations that generate (i.e., CREATE DATABASE) or free (i.e., TRUNCATE) a lot of data pages while generating a small amount of WAL. Are there other operations like this?
Currently, CREATE DATABASE operations are restricted in the console. So this is not an issue.
### **Security implications**
We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM.
Malicious users may change the `zenith.max_cluster_size`, so we need an extra size limit check.
To cover this case, we also monitor the compute node size in the console.
Pageserver is mainly configured via a `pageserver.toml` config file.
If there's no such file during `init` phase of the server, it creates the file itself. Without 'init', the file is read.
There's a possibility to pass an arbitrary config value to the pageserver binary as an argument: such values override
the values in the config file, if any are specified for the same key and get into the final config during init phase.
### Config example
```toml
# Initial configuration file created by 'pageserver --init'
listen_pg_addr='127.0.0.1:64000'
listen_http_addr='127.0.0.1:9898'
checkpoint_distance='268435456'# in bytes
checkpoint_period='1 s'
gc_period='100 s'
gc_horizon='67108864'
max_file_descriptors='100'
# initial superuser role name to use when creating a new tenant
initial_superuser_name='zenith_admin'
# [remote_storage]
```
The config above shows default values for all basic pageserver settings.
Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank.
Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start.
Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#table) in TOML specification and
- either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'`
- or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}`
### Config values
All values can be passed as an argument to the pageserver binary, using the `-c` parameter and specified as a valid TOML string. All tables should be passed in the inline form.
Note that TOML distinguishes between strings and integers, the former require single or double quotes around them.
#### checkpoint_distance
`checkpoint_distance` is the amount of incoming WAL that is held in
the open layer, before it's flushed to local disk. It puts an upper
bound on how much WAL needs to be re-processed after a pageserver
crash. It is a soft limit, the pageserver can momentarily go above it,
but it will trigger a checkpoint operation to get it back below the
limit.
`checkpoint_distance` also determines how much WAL needs to be kept
durable in the safekeeper. The safekeeper must have capacity to hold
this much WAL, with some headroom, otherwise you can get stuck in a
situation where the safekeeper is full and stops accepting new WAL,
but the pageserver is not flushing out and releasing the space in the
safekeeper because it hasn't reached checkpoint_distance yet.
`checkpoint_distance` also controls how often the WAL is uploaded to
S3.
The unit is # of bytes.
#### compaction_period
Every `compaction_period` seconds, the page server checks if
maintenance operations, like compaction, are needed on the layer
files. Default is 1 s, which should be fine.
#### compaction_target_size
File sizes for L0 delta and L1 image layers. Default is 128MB.
#### gc_horizon
`gz_horizon` determines how much history is retained, to allow
branching and read replicas at an older point in time. The unit is #
of bytes of WAL. Page versions older than this are garbage collected
away.
#### gc_period
Interval at which garbage collection is triggered. Default is 100 s.
#### image_creation_threshold
L0 delta layer threshold for L1 iamge layer creation. Default is 3.
#### pitr_interval
WAL retention duration for PITR branching. Default is 30 days.
#### initial_superuser_name
Name of the initial superuser role, passed to initdb when a new tenant
is initialized. It doesn't affect anything after initialization. The
default is Note: The default is 'zenith_admin', and the console
depends on that, so if you change it, bad things will happen.
#### page_cache_size
Size of the page cache, to hold materialized page versions. Unit is
number of 8 kB blocks. The default is 8192, which means 64 MB.
#### max_file_descriptors
Max number of file descriptors to hold open concurrently for accessing
layer files. This should be kept well below the process/container/OS
limit (see `ulimit -n`), as the pageserver also needs file descriptors
for other files and for sockets for incoming connections.
#### pg_distrib_dir
A directory with Postgres installation to use during pageserver activities.
Inside that dir, a `bin/postgres` binary should be present.
The default distrib dir is `./tmp_install/`.
#### workdir (-D)
A directory in the file system, where pageserver will store its files.
The default is `./.zenith/`.
This parameter has a special CLI alias (`-D`) and can not be overridden with regular `-c` way.
##### Remote storage
There's a way to automatically back up and restore some of the pageserver's data from working dir to the remote storage.
The backup system is disabled by default and can be enabled for either of the currently available storages:
###### Local FS storage
Pageserver can back up and restore some of its workdir contents to another directory.
For that, only a path to that directory needs to be specified as a parameter:
```toml
[remote_storage]
local_path='/some/local/path/'
```
###### S3 storage
Pageserver can back up and restore some of its workdir contents to S3.
Full set of S3 credentials is needed for that as parameters.
Configuration example:
```toml
[remote_storage]
# Name of the bucket to connect to
bucket_name='some-sample-bucket'
# Name of the region where the bucket is located at
bucket_region='eu-north-1'
# A "subfolder" in the bucket, to use the same bucket separately by multiple pageservers at once.
# Optional, pageserver uses entire bucket if the prefix is not specified.
prefix_in_bucket='/some/prefix/'
# S3 API query limit to avoid getting errors/throttling from AWS.
concurrency_limit=100
```
If no IAM bucket access is used during the remote storage usage, use the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables to set the access credentials.
###### General remote storage configuration
Pagesever allows only one remote storage configured concurrently and errors if parameters from multiple different remote configurations are used.
No default values are used for the remote storage configuration parameters.
Besides, there are parameters common for all types of remote storage that can be configured, those have defaults:
```toml
[remote_storage]
# Max number of concurrent timeline synchronized (layers uploaded or downloaded) with the remote storage at the same time.
max_concurrent_syncs=50
# Max number of errors a single task can have before it's considered failed and not attempted to run anymore.
Below you will find a brief overview of each subdir in the source tree in alphabetical order.
`/control_plane`:
Local control plane.
Functions to start, configure and stop pageserver and postgres instances running as a local processes.
Intended to be used in integration tests and in CLI tools for local installations.
`/docs`:
Documentaion of the Zenith features and concepts.
Now it is mostly dev documentation.
`/monitoring`:
TODO
`/pageserver`:
Zenith storage service.
The pageserver has a few different duties:
- Store and manage the data.
- Generate a tarball with files needed to bootstrap ComputeNode.
- Respond to GetPage@LSN requests from the Compute Nodes.
- Receive WAL from the WAL service and decode it.
- Replay WAL that's applicable to the chunks that the Page Server maintains
For more detailed info, see [/pageserver/README](/pageserver/README.md)
`/proxy`:
Postgres protocol proxy/router.
This service listens psql port, can check auth via external service
and create new databases and accounts (control plane API in our case).
`/test_runner`:
Integration tests, written in Python using the `pytest` framework.
`/vendor/postgres`:
PostgreSQL source tree, with the modifications needed for Zenith.
`/vendor/postgres/contrib/zenith`:
PostgreSQL extension that implements storage manager API and network communications with remote page server.
`/vendor/postgres/contrib/zenith_test_utils`:
PostgreSQL extension that contains functions needed for testing and debugging.
`/safekeeper`:
The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
It acts as a holding area and redistribution center for recently generated WAL.
For more detailed info, see [/safekeeper/README](/safekeeper/README.md)
`/workspace_hack`:
The workspace_hack crate exists only to pin down some dependencies.
We use [cargo-hakari](https://crates.io/crates/cargo-hakari) for automation.
`/zenith`
Main entry point for the 'zenith' CLI utility.
TODO: Doesn't it belong to control_plane?
`/libs`:
Unites granular neon helper crates under the hood.
`/libs/postgres_ffi`:
Utility functions for interacting with PostgreSQL file formats.
Misc constants, copied from PostgreSQL headers.
`/libs/utils`:
Generic helpers that are shared between other crates in this repository.
A subject for future modularization.
`/libs/metrics`:
Helpers for exposing Prometheus metrics from the server.
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.
A single virtual environment with all dependencies is described in the single `Pipfile`.
### Prerequisites
- Install Python 3.7 (the minimal supported version) or greater.
- Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected.
- If you have some trouble with other version you can resolve it by installing Python 3.7 separately, via pyenv or via system package manager e.g.:
```bash
# In Ubuntu
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
```
- Install `poetry`
- Exact version of `poetry` is not important, see installation instructions available at poetry's [website](https://python-poetry.org/docs/#installation)`.
- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.7 so if you have different version some linting tools can yield different result locally vs in the CI.
Run `poetry shell` to activate the virtual environment.
Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`.
### Obligatory checks
We force code formatting via `yapf` and type hints via `mypy`.
Run the following commands in the repository's root (next to `setup.cfg`):
```bash
poetry run yapf -ri . # All code is reformatted
poetry run mypy . # Ensure there are no typing errors
```
**WARNING**: do not run `mypy` from a directory other than the root of the repository.
Otherwise it will not find its configuration.
Also consider:
* Running `flake8` (or a linter of your choice, e.g. `pycodestyle`) and fixing possible defects, if any.
* Adding more type hints to your code to avoid `Any`.
### Changing dependencies
To add new package or change an existing one you can use `poetry add` or `poetry update` or edit `pyproject.toml` manually. Do not forget to run `poetry lock` in the latter case.
More details are available in poetry's [documentation](https://python-poetry.org/docs/).
"CREATE TABLE t(key int primary key, value text)",
);
letcp=storage_cplane.clone();
letfailures_thread=thread::spawn(move||{
simulate_failures(cp);
});
letmutpsql=node.open_psql("postgres");
foriin1..=1000{
psql.execute("INSERT INTO t values ($1, 'payload')",&[&i])
.unwrap();
}
letcount: i64=node
.safe_psql("postgres","SELECT sum(key) FROM t")
.first()
.unwrap()
.get(0);
println!("sum = {}",count);
assert_eq!(count,500500);
storage_cplane.stop();
failures_thread.join().unwrap();
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.