This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.
In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.
Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.
wal_storage.rs was split up from timeline.rs, safekeeper.rs and send_wal.rs,
and now contains all WAL related code from the safekeeper. Now there are
PhysicalStorage for persisting WAL to disk and WalReader for reading it.
This allows optimizing PhysicalStorage without affecting too much of other
code.
Also there is a separate structure for persisting control file now in
control_file.rs.
Previous version of spec caused parsing errors in generated clients
as return type is object not array, also one field was missing. In
a passing set `format: hex` on ancestor_id too as value conforms to
that format.
This change makes most parts of the code asynchronous, except
for the `mgmt` subsystem (we're going to drop it anyway).
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
If a heap UPDATE record modified two pages, and both pages needed to have
their VM bits cleared, and the VM bits were located on the same VM page,
we would emit two ZenithWalRecord::ClearVisibilityMapFlags records for
the same VM page. That produced warnings like this in the pageserver log:
Page version Wal(ClearVisibilityMapFlags { heap_blkno: 18, flags: 3 }) of rel 1663/13949/2619_vm blk 0 at 2A/346046A0 already exists
To fix, change ClearVisibilityMapFlags so that it can update the bits
for both pages as one operation.
This was already covered by several python tests, so no need to add a
new one. Fixes#1125.
Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
It was printing a lot of stuff to the log with INFO level, for routine
things like receiving or sending messages. Reduce the noise. The amount
of logging was excessive, and it was also consuming a fair amount of CPU
(about 20% of safekeeper's CPU usage in a little test I ran).
* Always initialize flush_lsn/commit_lsn metrics on a specific timeline, no more `n/a`
* Update flush_lsn metrics missing from cba4da3f4d
* Ensure that flush_lsn found on load is >= than both commit_lsn and truncate_lsn
* Add some debug logging
Use GUC zenith.max_cluster_size to set the limit.
If limit is reached, extend requests will throw out-of-space error.
When current size is too close to the limit - throw a warning.
Add new test: test_timeline_size_quota.
Use log::error!() instead. I spotted a few of these "connection error"
lines in the logs, without timestamps and the other stuff we print for
all other log messages.
Timeline is active whenever there is at least 1 connection from compute or
pageserver is not caught up. Currently 'active' means callmemaybes are being
sent.
Fixes race: now suspend condition checking and callmemaybe unsubscribe happen
under the same lock.
Now it's possible to call Fe{Startup,}Message in both
sync and async contexts, which is good for proxy.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
* Fix checkpoint.nextXid update
* Add test for cehckpoint.nextXid
* Fix indentation of test_next_xid.py
* Fix mypy error in test_next_xid.py
* Tidy up the test case.
* Add a unit test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Freeze vectors at the same end LSN
* Fix calculation of last LSN for inmem layer
* Do not advance disk_consistent_lsn is no open layer was evicted
* Fix calculation of freeze_end_lsn
* Let start_lsn be larger than oldest_pending_lsn
* Rename 'oldest_pending_lsn' and 'last_lsn', add comments.
* Fix future_layerfiles test
* Update comments conserning olest_lsn
* Update comments conserning olest_lsn
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Now zenith_cli handles wal_acceptors config internally, and if we
will append wal_acceptors to postgresql.conf in python tests, then
it will contain duplicate wal_acceptors config.
Currently ztimelineids are unique, but all APIs accept the pair, so let's keep
it everywhere for uniformity.
Carry around ZTTId containing both ZTenantId and ZTimelineId for simplicity.
(existing clusters on staging ought to be preprocessed for that)
* Reproduce github issue #1047.
* Use RwLock to protect gc_cuttof_lsn
* Eeduce number of updates in test_gc_aggressive
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages test
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages
* Lock latest_gc_cutoff_lsn in all operations accessing storage to prevent race conditions with GC
* Remove random sleep between wait_for_lsn and get_page_at_lsn
* Initialize latest_gc_cutoff with initdb_lsn and remove separate check that lsn >= initdb_lsn
* Update test_prohibit_branch_creation_on_pre_initdb_lsn test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
to pass current_timeline_size to compute node
Put standby_status_update fields into ZenithFeedback and send them as one message.
Pass values sizes together with keys in ZenithFeedback message.
This patch includes attach/detach http endpoints in pageservers. Some
changes in callmemaybe handling inside safekeeper and an integrational
test to check migration with and without load. There are still some
rough edges that will be addressed in follow up patches
Mainly because it has better support for installing the packages from
different python versions.
It also has better dependency resolver than Pipenv. And supports modern
standard for python dependency management. This includes usage of
pyproject.toml for project specific configuration instead of per
tool conf files. See following links for details:
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/https://www.python.org/dev/peps/pep-0518/
* Do not delete layers beyand cutoff LSN
* Update pageserver/src/layered_repository/layer_map.rs
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.
Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.
This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
Log the error and continue. Hopefully it's a transient failure.
This might have been happening in staging earlier, when the safekeeper
had a problem where it opened connections very frequently to issue
"callmemaybe" commands. If you launch too many threads too fast, you might
run out of file descriptors or something. It's not totally clear what
happened, but with commit, at least the page server will continue to run
and accept new connections, if a transient error happens.
'anyhow' crate can include a backtrace in all errors, when the
'backtrace' feature is enabled. Enable it, and change the places that used
'{:#}' or '{}' to '{:?}', so that the backtrace is printed.
A timeline ID is only guaranteed to be unique for a particular tenant,
so you need to use tenant ID + timeline ID as the key, rather than just
timeline ID.
The safekeeper currently makes the same assumption, and we should fix that
too, but this commit just addresses this one case in the page server.
In the passing, reorder some function arguments to be more consistent.
The walkeeper launch two threads for each connection, and uses a guard
object to remove entry from 'replicas' array, when finishes. But only
the background thread held onto the guard object, so if the background
thread finished before the other thread, the array entry would be
removed prematurely, which lead to panic in the check_stop_streaming()
call.
Fixes https://github.com/zenithdb/zenith/issues/1103
Hexalize zids there for better output; since Serde doesn't support several
formats for one struct, on-disk representation is changed as well, make
upgrade.rs cope with it.
to avoid a subtle race condition.
Without safekeeper, walreceiver reconnection can stuck,
because of IO deadlock between walsender auth and regular backend.
* Do not hold timelines lock during GC
refer #1087
* Add gc_cs mutex for preveting creation of new timelines during GC
* Make clippy happy
* Use Mutex<()> instead of Mutex<i32> for GC critical section
Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL
record that is replayed with the Postgres WAL redo process, or a built-in
type that is handled entirely by pageserver code.
Replace the special code to replay Postgres XACT commit/abort records
with new Zenith WAL records. A separate zenith WAL record is created for
each modified CLOG page. This allows removing the 'main_data_offset'
field from stored PostgreSQL WAL records, which saves some memory and
some disk space in delta layers.
Introduce zenith WAL records for updating bits in the visibility map.
Previously, when e.g. a heap insert cleared the VM bit, we duplicated the
heap insert WAL record for the affected VM page. That was very wasteful.
The heap WAL record could be massive, containing a full page image in
the worst case. This addresses github issue #941.
The first COPY generates about 230 MB of write I/O, but the second
COPY, after deleting most of the rows and vacuuming the rows away,
generates 370 MB of writes. Both COPYs insert the same amount of data,
so they should generate roughly the same amount of I/O. This commit
doesn't try to fix the issue, just adds a test case to demonstrate it.
Add a new 'checkpoint' command to the pageserver API. Previously,
we've used 'do_gc' for that, but many tests, including this new one,
really only want to perform a checkpoint and don't care about GC. For
now, I only used the command in the new test, though, and didn't
convert any existing tests to use it.
It creates busy loop if pageserver <-> safekeeper connection fails after it was
established (e.g. currently due to 'segment checkpoint not found' error on
pageserver).
Also wake up callmemaybe thread regularly once in recall_period regardless of
channel activity.
This patch allows to shutdown wal receiver when there are no messages
and wal receiver is blocked inside tokio-postgres. In this case it
cannot check the shutdown flag.
This patch switches to use async interface of tokio-postgres directly
without sync wrappers. It opens the possibility to use tokio::select!
between the phsycal_stream.next() and a shutdown channel readiness to
interrupt replication process.
Also this allows to shutdown only particular wal receiver without
using global shutdown_requested flag.
Currently it's included with minimal changes and lives aside of the main
workspace. Later we may re-use and combine common parts with zenith
control_plane.
This change is mostly needed to unify cloud deployment pipeline:
1.1. build compute-tools image
1.2. build compute-node image based on the freshly built compute-tools
2. build zenith image
So we can roll new compute image and new storage required by it to
operate properly. Also it becomes easier to test console against some
specific version of compute-node/-tools.
If safekeepers sync fast enough, callmemaybe thread may never make a call before receiving Unsubscribe request. This leads to the situation, when pageserver lacks data that exists on safekeepers.
Do it separately with SafekeeperPostgresCommand enum as a result. Since query is
always C string, switch postgres_backend process_query argument from Bytes to
&str.
Make passing ztli/ztenant id in safekeeper connection string optional; this is
needed for upcoming intra-safekeeper heartbeat cmd which is not bound to any
timeline.
Change meaning of lsns in HOT_STANDBY_FEEDBACK:
flush_lsn = disk_consistent_lsn,
apply_lsn = remote_consistent_lsn
Update compute node backpressure configuration respectively.
Update compute node configuration:
set 'synchronous_commit=remote_write' in setup without safekeepers.
This way compute node doesn't have to wait for data checkpoint on pageserver.
This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.
Introduce builder objects, DeltaLayerWriter and ImageLayerWriter.
This gives more flexibility, as the DeltaLayer::create and
ImageLayer::create functions don't need to know about the details of
the format of where the page versions are coming from. This allows us
to change the format used in InMemoryLayer more easily, without having
to modify Delta- and ImageLayer code.
Also refactor the code in InMemoryLayer::write_to_disk for clarity.
Previously, the 'blknum' argument of various Layer functions was the
block number within the overall relation. That was pretty confusing,
because an individual layer only holds data from a one segment of the
relation. Furthermore, the 'put_truncation' function already dealt
with per-segment size, not overall relation size, adding to the
confusion.
Change the meaning of the 'blknum' argument to mean the block number
within the segment, not the overall relation.
If a commit record contains XIDs that are stored on different CLOG pages,
we duplicate the commit record for each affected CLOG page. In the redo
routine, we must only apply the parts of the record that apply to the
CLOG page being restored. We got that right in the loop that handles the
sub-XIDs, but incorrectly always set the bit that corresponds to the main
XID.
The logic to compute the page number was broken, and as a result, only
the first page of multixact members was updated correctly. All the
rest were left as zeros. Improve test_multixact.py to generate more
multixacts, to cover this case.
Also fix the check that the restored PG data directory matches the
original one. Previously, the test compared the 'pg_new' cluster,
which is a bit silly because the test restored the 'pg_new' cluster
only a few lines earlier, so if the multixact WAL redo is somehow
broken, the comparison will just compare two broken data directories
and report success. Change it to compare the original datadir, the one
where the multixacts were originally created, with a restored image of
the same.
WAL stream uses the 2 connections:
1. Compute node (walproposer) -> Safekeeper (ReceiveWalConn module)
When compute node is shut down, safekeeper needs to stop the respective receiving thread.
Prior to this PR it didn't work because PostgresBackend haven't handled disconnection properly.
2. Safekeeper (ReplicationConn module) -> pageserver (walreceiver thread)
When incoming WAL stream is gone, safekeeper can stop streaming WAL and cancel connection as soon as replica is caught up.
Note that the WAL can be streamed to multiple replicas simultaneously, only disconnect ones that are caught up to the last_recieved_lsn.
The ephemeral files are not usable after restart, so just delete them.
Before this, you got "unrecognized filename in timeline dir" warnings
about them, as Konstantin noted at:
https://github.com/zenithdb/zenith/issues/906#issuecomment-995530870.
While we're at it, refactor away the list_files() function, moving the
logic fully into the caller. Seems more straightforward.
Rename save_decoded_record() to ingest_record(), and move the
responsibility for decoding the record into ingest_record().
Also move the responsibility of updating the CheckPoint relish to
ingest_record(). Put it in a new WalIngest struct, to help with tracking
that.
We depends on rustls in postgres_backend anyway, so might as well use it
for all TLS stuff. Seems better to depend on only one library both from a
security point of view, and because fewer dependencies means less code to
compile. With this commit, we no longer depend on OpenSSL.
Move the code for decoding a WAL stream into WAL records into
'postgres_ffi', and keep the code to parse the WAL records deeper in
'pageserver' crate, renamed to walrecord.rs.
This tidies up the dependencies a bit. 'walkeeper' reuses the same
waldecoder routines, and it used to depend on 'pageserver' because of
that. Now it only depends on 'postgres_ffi'.
(The comment in walkeeper/Cargo.toml that claimed that the dependency was
needed for ZTimelineId was obsolete. ZTimelineId is defined in
'zenith_utils', the dependency was actually needed for the waldecoder.)
0.28.0 includes two changes I submitted to upstream:
- Add support for older ListObjects API, needed to use rust-s3 with Google
Cloud Storage: https://github.com/durch/rust-s3/pull/229
- If file is smaller than one chunk, don't initiate multi-part upload.
https://github.com/durch/rust-s3/pull/228
These are not critical for Zenith right now, but let's stay up-to-date.
Currently, we return an all-zeros page if you request a block beyond end of
a relation. That has been implemented in LayeredTimeline::materialize_page,
so that if Layer::get_page_reconstruct_data returns Missing, it returns
and all-zeros page.
However InMemoryLayer and DeltaLayer would return Continue, not Missing,
in that case, and materialize_page would try to find the predecessor
layer. If there was a preceding image layer, then everything would still
work, but if there wasn't, it would return a "could not find predecessor
of layer" error. Fix that in InMemoryLayer and DeltaLayer, making them
check the size of the relation and return Missing in that case.
This is hard to reproduce at the moment, but it happened quickly with
pgbench when I modified InMemoryLayer::write_to_disk so that it didn't
always create a new ImageLayer.
- Don't spawn a separate thread for each connection.
Instead use one thread per safekeeper, that iterates over all connections and sends callback requests for them.
-Use tokio postgres to connect to the pageserver, to avoid spawning a new thread for each connection.
callmemaybe review fixes:
- Spawn all request_callback tasks separately.
- Remember 'last_call_time' and only send request_callback if 'recall_period' has passed.
- If task hasn't finished till next recall, abort it and try again.
- Add pause/resume CallmeEvents to avoid spamming pageserver when connection already established.
This is needed for implementation of tenant rebalancing. With this
change safekeeper becomes aware of which pageserver is supposed to be
used for replication from this particular compute.
Should have been a part of cba4da3f4d to provide upgrade for previously
existing clusters. Separates version independent header (magic + version) out of
SafeKeeperState to choose what to deserialize.
This patch introduces fixes for several problems affecting
LLVM-based code coverage:
* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.
* Implement proper shutdown handlers in safekeeper.
write_lsn - The last LSN received and processed by pageserver's walreceiver.
flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers.
apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).
Persist full history of term switches on safekeepers instead of storing only the
single term of the highest entry (called epoch). This allows easily and
correctly find the divergence point of two logs and truncate the obsolete part
before overwriting it with entries of the newer proposer(s).
Full history of the proposer is transferred in separate message before proposer
starts streaming; it is immediately persisted by safekeeper, though he might not
yet have entries for some older terms there. That's because we can't atomically
append to WAL and update the control file anyway, so locally available WAL must
be taken into account when looking at the history.
We should sometimes purge term history entries beyond truncate_lsn; this is not
done here.
Per https://github.com/zenithdb/rfcs/pull/12Closes#296.
Bumps vendor/postgres.
While we're at it, reuse the Book and the VirtualFile that's backing
it even over unload() calls. Previously, we would keep the Book open,
but on load(), we would re-open it anyway, which didn't make much
sense. Now we reuse it it. Alternatively, perhaps we should close it
on unload() to save some memory, but this I'm not going to think too
hard about it right now as the whole load/unload thing is a bit of a
hack and needs to be rewritten.
This is hard to reproduce ATM, because the incorrect state would get
fixed by an unload(). A checkpoint creates the DeltaLayer, and it also
calls unload() afterwards, so the window is not very large. I hit it
occasionally with a scale 1000 pgbench test, after I had modified
InMemoryLayer::write_to_disk() to not write an image layer every time,
which made the DeltaLayers be accessed more often.
This doesnt show up at the moment, because we never create a delta
layer with end-LSN equal to the last LSN. We always create an image
layer at that LSN instead. For example, if the latest processed LSN is
100, we would create a delta layer with end LSN 100 (exclusive), and
an image layer at 100. But that's just how InMemoryLayer::write_to_disk
happens to work at the moment, there's no fundamental reason it needs
to always create that image layer. I noticed this bug when I tried to
change the logic in InMemoryLayer::write_to_disk to only create an
image layer after a few delta layers.
The "in-memory layer" is misnomer now, each in-memory layer is now actually
backed by a file. The files are ephemeral, in that they don't survive page
server crash or shutdown.
To avoid reading the file for every operation,
"ephemeral files" are cached in a page cache.
This includes changes from 'inmemory-layer-chunks' branch to serialize /
the page versions when they are added to the open layer. The difference is
that they are not serialized to the expandable in-memory "chunk buffer", but
written out to the file.
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.
To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
when basebackup retured an error. Previously pageserver thread just
died.
* Remove "ancestor" file which previously contained ancestor id and
branch lsn. Currently the same data can be obtained from metadata file.
And just the way we handled ancestor file in the code introduced the
case when branching fails timeline directory is created but there is no data in it
except ancestor file. And this confused gc because it scans
directories. So it is better to just remove ancestor file and clean up
this timeline directory creation so it happens after all validity
checks have passed
Ever since we've had frozen in-memory layers, having an 'end_lsn' no
longer means that the layer has been dropped. Need to check the 'dropped'
flag explicitly.
This was reliably causing a failure on the new 'test_parallel_copy' test
in https://github.com/zenithdb/zenith/pull/864. I'm not sure why it
doesn't happen on main branch, but the bug is pretty straightforward when
you see it.
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.
The buffer cache is shared across all tenants, allowing memory to be
dynamically allocated where it's needed the most. The cache works on 8 kB
pages, and uses the clock algorithm for replacement policy; same as the
PostgreSQL buffer cache.
One peculiarity is that the materialized page versions can be looked up
by an inexact LSN, to find the latest page version with an LSN >= the
search key.
The code is structured to support caching other kinds of pages in the same
cache in the future, but with a different mapping key.
Co-authored-by: Patrick Insinger <patrick@zenith.tech>
During parallel load of a table, Postgres sometimes requests a page from
the page server for which no WAL has been generated yet. That's normal;
Postgres expects the page to be full of zeros. There was a special case
for that in LayeredTimeline::materialize_page, but the problem remained
when you're crossing a segment boundary, so that there's no layer for
the segment at all.
It would be nice to have a more robust cross-check for this case. That
might need help from the Postgres side. But this extends the bandaid fix
we had in materialize_page() to the case where cross segment boundary.
Fixes https://github.com/zenithdb/zenith/issues/841
There were two separate locking issues that could lead to a deadlock,
both related to holding a lock for longer than necessary:
1. In the loop in `VirtualFile::with_file`, the "handle_guard" was
held across iterations of the loop. Because of that, if the handle was
changed by a concurrent thread, the loop would try to acquire the
handle lock, when it was still holding the lock from previous
iteration. To fix, release the lock earlier. There was no need to hold
it across iterations, it was just accidental.
2. In the same function, we also held the "slot_guard" longer than
necessary. It's only needed in the first part of the loop, where we
check if the current handle is valid. If it's not, the slot lock can
be immediately released. But it was not, it was kept over the
acquisition of the handle lock. I'm not sure if that alone could cause
problems, but let's release the lock as soon as possible anyway.
Add a test case, based on Konstantin's test program to demonstrate the
deadlock.
Currently, whenever a page version is needed from an image or delta
layer, we open the file and read and parse the bookfile headers. That's
pretty expensive. To reduce the overhead, introduce a cache of open file
descriptors, and use that to cache the Book objects so that we don't need
to read the metadata on every access.
Now safekeeper control file updated in a following way:
1. Write data to temp file
2. Fsync the temporary file (if sync option is specified)
3. Rename temporary file to actual control file
4. Fsync containing directory (if sync option is specified)
5. Fsync file after rename (if sync option is specified).
Note that action 5 is not mentioned anywhere as required but it is done
in postgres this way (see durable_rename).
Also because of the rename machinery switch to use dedicated lock file
to prevent running several safekeepers concurrently on the same data
cleanup
fsync control file after rename to match postgres behaviour
* change zenith-perf-data checkout ref to be main
* set cluster id through secrets so there is no code changes required
when we wipe out clusters on staging
* display full pgbench output on error
Commit 960c7d69a8 changed the LSN returned in the Continue case in
InMemoryLayer::get_page_reconstruct_data(), but neglected to make the
same change in DeltaLayer.
Also add an escape hatch to the loop in materialize_page() to avoid
getting stuck in an infinite loop, if a bug like this reoccurs.
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
The tokio futures added some overhead, so switch to plain non-blocking
I/O with poll(). In a simple pgbench test on my laptop (select-only
queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10%
improvement, from about 4300 TPS to 4800 TPS.
Instead of building a separate Vec<u8> to hold each message, serialize all
the messages to one big Vec<u8>. This eliminates some Vec allocation and
memcpy() overhead. The downside is that if there are a lot of records to
replay, we have to serialize them all into one big chunk of memory.
That shouldn't be a problem in practice. If you need to replay millions
of records to reconstruct a page, we should've materialized a new image
of that page earlier already.
tests are based on self-hosted runner which is physically close
to our staging deployment in aws, currently tests consist of
various configurations of pgbenchi runs.
Also these changes rework benchmark fixture by removing globals and
allowing to collect reports with desired metrics and dump them to json
for further analysis. This is also applicable to usual performance tests
which use local zenith binaries.
We might want to have custom serialize/deserialize functions for
WALRecords and PageVersions for performance reasons, see github issue 832.
But that would probably look a bit different from this, and currently
these functions are just dead.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.
After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.
This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.
Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
This calculation is not that heavy but it is needed only in tests, and
in case the number of tenants/timelines is high the calculation can take
noticeable time.
Resolves https://github.com/zenithdb/zenith/issues/804
The 'zenith' CLI utility can now be used to launch safekeepers. By
default, one safekeeper is configured. There are new 'safekeeper
start/stop' subcommands to manage the safekeepers. Each safekeeper is
given a name that can be used to identify the safekeeper to start/stop
with the 'zenith start/stop' commands. The safekeeper data is stored
in '.zenith/safekeepers/<name>'.
The 'zenith start' command now starts the pageserver and also all
safekeepers. 'zenith stop' stops pageserver, all safekeepers, and all
postgres nodes.
Introduce new 'zenith pageserver start/stop' subcommands for
starting/stopping just the page server.
The biggest change here is to the 'zenith init' command. This adds a
new 'zenith init --config=<path to toml file>' option. It takes a toml
config file that describes the environment. In the config file, you
can specify options for the pageserver, like the pg and http ports,
and authentication. For each safekeeper, you can define a name and the
pg and http ports. If you don't use the --config option, you get a
default configuration with a pageserver and one safekeeper. Note that
that's different from the previous default of no safekeepers. Any
fields that are omitted in the configuration file are filled with
defaults. You can also specify the initial tenant ID in the config
file. A couple of sample config files are added in the control_plane/
directory.
The --pageserver-pg-port, --pageserver-http-port, and
--pageserver-auth options to 'zenith init' are removed. Use a config
file instead.
Finally, change the python test fixtures to use the new 'zenith'
commands and the config file to describe the environment.
We've seen some failures with "Address already in use" errors in the
tests. It's not clear why, perhaps some server processes are not cleaned
up properly after test, or maybe the socket is still in TIME_WAIT state.
In any case, let's make the tests more robust by checking that the port
is free, before trying to use it.
We generate the initial tenantid and store it in the file, so it shouldn't
be missing. But let's cope with it. (This comes handy with the bigger
changes I'm working on at https://github.com/zenithdb/zenith/pull/788)
Instead of having a lot of separate fixtures for setting up the page
server, the compute nodes, the safekeepers etc., have one big ZenithEnv
object that encapsulates the whole environment. Every test either uses
a shared "zenith_simple_env" fixture, which contains the default setup
of a pageserver with no authentication, and no safekeepers. Tests that
want to use safekeepers or authentication set up a custom test-specific
ZenithEnv fixture.
Gathering information about the whole environment into one object makes
some things simpler. For example, when a new compute node is created,
you no longer need to pass the 'wal_acceptors' connection string as
argument to the 'postgres.create_start' function. The 'create_start'
function fetches that information directly from the ZenithEnv object.
Each test now gets its own test output directory, like
'test_output/test_foobar', even when TEST_SHARED_FIXTURES is used.
When TEST_SHARED_FIXTURES is not used, the zenith repo for each test
is created under a 'repo' subdir inside the test output dir, e.g.
'test_output/test_foobar/repo'
The -D option to specify working directory was broken:
$ mkdir foobar
$ ./target/debug/safekeeper -D foobar
Error: failed to open "foobar/safekeeper.log"
Caused by:
No such file or directory (os error 2)
This was because we both chdir'd into to specified directory, and also
prepended the directory to all the paths. So in the above example, it
actually tried to create the log file in "foobar/foobar/safekepeer.log"
Change it to work the same way as in the pageserver: chdir to the
specified directory, and leave 'workdir' always set to ".".
We wouldn't necessarily need the 'workdir' variable in the config at all,
and could assume that the current working directory is always the
safekeeper data directory, but I'd like to keep this consistent with the
the pageserver. The page server doesn't assume that for the sake of unit
tests. We don't currently have unit tests in the safekeeper that write
to disk but we might want to in the future.
* We actually need Python 3.7 because of dataclasses
* Rerun 'pipenv lock' under Python 3.7 and add 'pipenv' to dev deps
* Update docs on developing for Python 3.7
* CircleCI: use Python 3.7 via Docker image instead of Orb
* Fix bugs found by mypy
* Add some missing types and runtime checks, remove unused code
* Make ZenithPageserver start right away for better type safety
* Add `types-*` packages to Pipfile
* Pin mypy version and run it on CircleCI
This change causes writer halves of a TLS stream to always flush after a
portion of bytes has been written by `std::io::copy`. Furthermore, some
cosmetic and minor functional changes are made to facilitate debug.
* Add yapf run to CircleCI
* Pin yapf version
* Enable `SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES` setting
* Reformat all existing code with slight manual adjustments
* test_runner/README: note that yapf is forced
Change 'zenith.signal' file to a human-readable format, similar to
backup_label. It can contain a "PREV LSN: %X/%X" line, or a special
value to indicate that it's OK to start with invalid LSN ('none'), or
that it's a read-only node and generating WAL is forbidden
('invalid').
The 'zenith pg create' and 'zenith pg start' commands now take a node
name parameter, separate from the branch name. If the node name is not
given, it defaults to the branch name, so this doesn't break existing
scripts.
If you pass "foo@<lsn>" as the branch name, a read-only node anchored
at that LSN is created. The anchoring is performed by setting the
'recovery_target_lsn' option in the postgresql.conf file, and putting
the server into standby mode with 'standby.signal'.
We no longer store the synthetic checkpoint record in the WAL segment.
The postgres startup code has been changed to use the copy of the
checkpoint record in the pg_control file, when starting in zenith
mode.
This is in preparation for supporting read-only nodes. You can launch
multiple read-only nodes on the same brach, so we need an identifier
for each node, separate from the branch name.
Previously, the first WAL record on the 'main' branch overwrote the
initial checkpoint record, with invalid 'xl_prev'. That's harmless, but
also pretty ugly. I bumped into this while I was trying to tighen up the
checks for when a valid 'prev_lsn' is required. With this patch, the
first WAL record gets a valid 'xl_prev' value. It doesn't matter much
currently, but let's be tidy.
Which is mainly generational state (terms) and useful LSNs.
Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes#726.
ref #115
This adds a fast-path for the common case that the record doesn't
cross a page boundary. We now split off a new Bytes directly from the
original input buffer in that case, instead of copying the record to a
new BytesMut. Shaves about 5% of the page server's CPU time on my
laptop, in the 'test_bulk_insert' test.
* Use logging in python tests
* Use f-strings for logs
* Don't log test output while running
* Use only pytest logging handler
* Add more info about pytest logging
* Rename wal_acceptor binary to safekeeper
* Rename wal_acceptor.pid and wal_acceptor.log to safekeeper.pid and safekeeper.log
* Change some mentions of WAL acceptor to safekeeper
* Dockerfile: alias wal_acceptor to safekeeper temporarily until internal scripts are updated
This change brings the following improvements to our build system:
* Now BUILD_TYPE also affects rust apps.
* From now on, cargo will respect `-jN` passed via `make`. However, note
that `rustc` may spawn multiple threads depending on compile flags.
* Cargo is able to cooperate with make to better schedule parallel jobs,
which leads to better build times (-20s in release mode on my machine).
No need to use BytesMut in these functions. Plain Vec is simpler. And
should be marginally faster too; I saw BytesMut functions previously
in 'perf' profile, consuming around 5% of the overall pageserver CPU
time. That's gone with this patch, although I don't see any discernible
difference in the overall performance test results.
Previously, the WAL receiver we would make a decoded copy of the current
Checkpoint before each WAL record, and compare it with the Checkpoint
after the record has been processed. If it has changed, the checkpoint
relish is updated in the repository. That's somewhat expensive, the
Checkpoint::encode() function is visible in 'perf' profile. Change that
so that we set a flag whenever the Checkpoint struct is modified, so that
we dont need to compare the whole struct anymore.
Previously, typos like `BUILD_TYPE=rlease` would silently
lead to building debug binaries. The current approach is also
more future-proof, since we might add `profile`, `valgrind`
as well as other build types.
- perform checkpoint for each tenant repository.
- wait for the completion of all threads.
Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.
When a WAL record affects multiple pages, we currently duplicate the
record for each affected page. That's a bit wasteful, but not too bad
for b-tree splits and non-hot heap updates that affect two pages. But
buffering GiST index build WAL-logs the whole relation in 32 page chunks,
with one giant WAL record for each 32-page chunk. Currently we duplicate
that giant record for each of the 32 pages, which is really wasteful.
Github issue https://github.com/zenithdb/zenith/issues/720 tracks the
problem. This commit adds a test case for it to demonstrate it.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.
This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.
Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.
We now obey the RUST_LOG env variable, if it's set.
The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing. For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
* `wal_acceptor`: add HTTP handler, /metrics endpoint only, no authentication
* Two gauges are currently reported: `flush_lsn` and `commit_lsn`
* Add `DEFAULT_PG_LISTEN_PORT` and `DEFAULT_PG_LISTEN_PORT` consts for uniformity
The caller is now responsible for lookin up the predecessor layer,
instead. This makes the code simpler, as you don't need to update the
predecessor reference when a layer is frozen or written to disk.
There was a bug in that, as Konstantin noted on discord:
Assume that freeze doesn't create new inmem layer
(maybe_new_open=None). Then we temporary place in historics frozen
layer. Assume that now new put_wal_record request arrives. There is
no open in-mem layer, so it has to create new one. It is looking for
previous layer for read and set it as new in-mem layer
predecessor. But as far as I understand, prev layer should be our
temporary frozen layer. Which will be then removed from
historics.
That leaves the predecessor field of the new in-memory layer pointing
at the frozen in-memory layer that has been removed from the layer map,
preventing it from being removed from memory.
This makes two subtle changes:
1. When the first new layer is created on a branch for a segment that
existed on the ancestor branch, the start_lsn of the new layer is now
the branch point + 1. We were previously slightly confused on what
the branch point LSN meant. It means that all the WAL up to and
*including* the LSN on the old branch is visible to the new branch.
If we mark the start LSN of the new layer as equal to the branch point,
that's wrong, because if there is a WAL record with that LSN on the
predecessor layer, the new layer would hide it. This bug was hidden
when the layer on the new branch contained a direct reference to the
layer in the old branch, as get_page_reconstruct_data() followed that
reference directly when it didn't find the page version in the new
layer. But now that the caller performs the lookup, it will look up
the new layer that doesn't contain the record, and you get an error.
2. InMemoryLayer now always stores the segment size at the beginning
of the layer's LSN range. Previously, get_seg_size() might have
recursed into the predecessor layer to get the size, but now we
avoid that by always copying over the last size from the previous
layer, when a new layer is created.
The dockerignore and dockerfile have also been excluded from being moved into
docker images, saving docker layer cache busts if only those are changed.
Commit ca9af37478 removed the import_timeline_wal() call from here.
After that, the info!() message is bogus, as we no longer load the WAL
from local disk. Also, the logical size assertion is pointless now.
All the changes are in the vendor/postgres side. However, because we now
generate fewer Full Page Writes, the 'branch_behind' test needs to be
modified so that it still generates enough WAL to consume a few WAL
segments.
The metadata file is now always 512 bytes. The last 4 bytes are a
crc32c checksum of the previous 508 bytes. Padding zeroes are added
between the serde serialization and the start of the checksum.
A single write call is used, and the file is fsyncd after.
On file creation, the parent directory is fsyncd as well.
It previously took &SafeKeeperState similar to persist(), but only for its
`server` member.
Now it takes &ServerInfo only, so there it's clear the state is not persisted.
Also added a comment about sync.
* Send ProposerGreeting manually in tests
* Move test_sync_safekeepers to test_wal_acceptor.py
* Capture test_sync_safekeepers output
* Add comment for handle_json_ctrl
* Save captured output in CI
* make .dockerignore `ncdu -X` compatible to easily inspect build context
* remove cargo-chef as it was introducing more problems than it was solving
* remove rocksdb packages
* add ca-certs in the resulting image. We need that to be able to make https
connections from container with proxy to the console.
Similar to what commit 7fb7f67b did to 'freeze', dealing with the
dropped segment separately from the rest of the logic makes the code
easier to follow. It is also needed by the next commit that replaces
the code to build new BTreeMap with an iterator; we cannot pass one
of two kinds of closures as argument, it has to always be the same one.
Having separate DeltaLayer::create() calls for the case of dropped
segment and the other cases works around that.
Positive EncryptionResponse should set 'S' byte, not 'Y'. With that
fix it is possible to connect to proxy with SSL enabled and read
deciphered notice text. But after the first query everything stucks.
rsa_private_keys() function returns an empty vector when tries to read
pkcs8-encoded file instead of returning an error. So previous check was
failing on pkcs8. Leave only pkcs8 for now.
Reduces the CPU time spent in the write() syscalls. I noticed that we were
spending a lot of CPU time in libc::write, coming from request_redo(), in
the 'bulk_insert' test. According to some quick profiling with 'perf',
this reduces the CPU time spent in request_redo() from about 30% to 15%.
For some reason, it doesn't reduce the overall runtime of the 'bulk_insert'
test much, maybe by one second if you squint (from about 37s to 36s), so
there must be some other bottleneck, like I/O. But this is surely still
a good idea, just based on the reduced CPU cycles.
Commit message copied below:
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
* Allow LeSer/BeSer impls missing Serialize/Deserialize
Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.
Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.
* Remove unused #[derive(Serialize/Deserialize)]
This should hopefully reduce compile times - if only by a little bit.
Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
This introduces a new tree data structure for holding intervals, and
queries of the form "which intervals contain the given point?". It then
uses that to store the Layers in the layer map, instead of the BTreeMap.
While we don't currently create overlapping layers in the page server,
that situation might arise in the future if we start to create extra
layers for performance purposes, or as part of some multi-stage
garbage collection operation that creates new layers in some interval
and then removes old ones. The situation might also arise if you have
multiple page servers running on the same timeline, freezing layers at
different points, and both uploading them to S3.
So even though overlapping layers might not happen currently, let's
avoid getting confused if it does happen for some reason.
Fixes https://github.com/zenithdb/zenith/issues/517.
After this, a layer's start bound is always defined to be inclusive, and
end bound exclusive.
For example, if you have a layer in the range 100-200, that layer can be
used for GetPage@LSN requests at LSN 100, 199, or anything in between.
But for LSN 200, you need to look at the next layer (if one exists).
This is one part of a fix for https://github.com/zenithdb/zenith/issues/517.
After this, the page server shouldn't create layers for the same segment
with the same LSN, which avoids the issue. However, the same thing would
still happen, if you managed to create layers with same start LSN again.
That could happen e.g. if you had two page servers running, or in some
weird crash/restart scenario, or due to bugs or features added later. The
next commit makes the layer map more robust, so that it tolerates that
situation without deleting wrong files.
New command has been added to append specially crafted records in safekeeper WAL. This command takes json for append, encodes LogicalMessage based on json fields, and processes new AppendRequest to append and commit WAL in safekeeper.
Python test starts up walkeepers and creates config for walproposer, then appends WAL and checks --sync-safekeepers works without errors. This test is simplest one, more useful test cases (like in #545) for different setups will be added soon.
Postgres commit message:
PQgetCopyData can sometimes indicate that the copy is done if the
backend returns an error response. So while we still expect that the
walkeeper never sends CopyDone, we can't expect it to never produce
errors.
- Turn dropped layers into non-writeable in get_layer_for_write().
- Handle non-writeable dropped layers in checkpointer. They don't need freezing, so just remove them from list of open_segs and write out to disk.
- Remove code that handles dropped layers in freeze() function. It is not used anymore.
Some dropped layers serve as tombstones for earlier layers and thus cannot be garbage collected.
Add new fields to GcResult for layers that are preserved as tombstones
anyhow uses the alternate formatting style ("{:#}") to display all of
the causes of an error instead of the outermost context.
Without this, there's less information available to figure out what's
going on. It's probably too much to display in the compute node logs
though, so it's better to leave that formatting as-is.
In this test safekeepers are restarted one by one, while bank transactions
are executed and validated in the background. Bank transactions consist of
balance transfers and log writes. In the end balance sum should remain the
same and there should be progress from every client, when 2 of 3 safekeeper
nodes are up.
It's not interesting for most tests, and clutters the output. If there
are individual tests where it is worthwhole, let's add pg_controldata calls
to those tests, but I don't think it's needed for now.
If the 'latest' flag in the client request is true, the client wants the
latest page version regardless of the LSN in the request. The LSN is just
a hint in that case, indicating that the page hasn't been modified since
since that LSN. The LSN can be very old, so it's possible that the page
server has already garbage collected away the layer at that LSN. We tried
to fetch the old layer and errored out if that happened. To fix, always
fetch the data as of last-record-LSN, if 'latest' is set in the client
request. We now only use the LSN to wait if the requested LSN hasn't been
received and processed yet.
Fixes https://github.com/zenithdb/zenith/issues/567
- Use different message formats for different kinds of response messages.
- Add an Error message, for passing errors from page server to Postgres.
Previously, we would respond to 'exists' request with 'false', and
to 'nblocks' request with 0, if an error happened. Fix those to return
an error message to the client. GetPage requests had a mechanism to
return an error, but it was just a flag with no error message.
- Add a flag to requests, to indicate that we actually want the latest
page version on the timeline, and the LSN is just a hint that we know
that there haven't been any modifications since that LSN. The flag isn't
used for anything yet, but I'm planning to use it to fix
https://github.com/zenithdb/zenith/issues/567
Most of the previous usages of get_repository_for_tenant were followed
by immediately getting a timeline in that repository, without keeping it
around for longer.
The new `get_timeline_for_tenant` function implements that same
behavior, but in one line.
- Change hardcoded OLDEST_INMEM_DISTANCE value to pageserver config option checkpoint_distance.
- Get rid of 'force' flag in checkpoint_internal(). Use checkpoint_distance=0 instead.
Support is done via pytest-xdist plugin.
To use the feature add -n<concurrency> to pytest invocation
e.g. pytest -n8 to run 8 tests in parallel.
Changes in code are mostly about ports assigning. Previously port for
pageserver was hardcoded without the ability to override through zenith
cli and ports for started compute nodes were calculated twice, in zenith
cli and in test code. Now zenith cli supports port arguments for
pageserver and compute nodes to be passed explicitly.
Tests are modified in such a way that each worker gets a non overlapping
port range which can be configured and now contains 100 ports. These
ports are distributed to test services (pageserver, wal acceptors,
compute nodes) so they can work independently.
Data written to frozen layers is lost. It will not appear in on-disk
structures or in successor InMemoryLayers. Here we detect this race, and
fail. I think this race is rare, but this should make it easier to track
down when it happens.
Implement the changes suggested in a comment, create
`get_layer_for_read_locked` so that `get_layer_for_write` doesn't have
to drop the LayerMap lock when searching for the predecessor.
This contains a lowest common denominator of pageserver and safekeeper log
initialisation routines. It uses daemonize flag to decide where to
stream log messages. In case daemonize is true log messages are
forwarded to file. Otherwise streaming to stdout is used. Usage of
stdout for log output is the default in docker side of things, so make
it easier to browse our logs via builtin docker commands.
This job will be responsible for triggering remote CI pipeline in
zenithdb/console repository. That way, we'll always know when
a PR to zenithdb/zenith breaks the cloud console app.
Otherwise restart of safekeeper before the first segment is filled makes it
report 0 as flushed LSN. To this end, tweak find_end_of_wal_segment to allow
starting from given LSN, not only from the start of the segment. While here,
make it less panicky.
Otherwise we produce corrupted record holes in WAL during compute node restart
in case there was an unfinished record from the old compute, as these reports
advance commit_lsn -- reliably persisted part of WAL.
ref #549.
Mostly by @knizhnik. I adjusted to make sure proposer always starts streaming
since record beginning so we don't need special quirks for decoding in
safekeeper.
Ran into problems launching the WAL redo process on OS X after 4b73ad.
Launching the `initdb` process was met with "bad file descriptor" errors.
Using dtrace, I found shortly after calling `posix_spawn` for `initdb`,
`kevent` was returning this error.
I haven't dug super deep to see if the daemonization itself is the
problem, but this commit fixes it for me. My hunch is that some file
descriptors used when the Tokio runtime is initailzed become invalid
in the daemon process.
When you advance last record LSN, *all* changes up to that LSN should be
imported into repository. We have been a bit sloppy about that when it
comes to the checkpoint information that we also store in the repository.
In WAL receiver, for example, we would receive a WAL record, advance
last record LSN, and only then update the checkpoint relish at the same
LSN. Reorder that so that you advance the last record LSN only after
updating the checkpoint relish. It hasn't apparently caused any problems
so far, but let's be tidy.
Tighten the check for that in get_layer_for_write(), so that it checks for
'lsn > last_record_lsn' rather than 'lsn >= last_record_lsn'.
Always use lsn(0) as the initial last_record_lsn. It is updated soon after
creating the timeline anyway, after loading the bootstrap data, so it
doesn't stay long in that state. I was a bit worried about using a special
value like 0, but it's actually nice that you can distinguish it from any
real LSN value. The unit tests have been using Lsn(0) as the initial start
LSN all along.
Move the responsibility to wait for the WAL to arrive to the callers, and
remove the wait_lsn() calls from the Timeline::get_page_at_lsn() and
friends. We were not totally consistent before, list_rels() was missing the
wait_lsn() call for example.
Closes https://github.com/zenithdb/zenith/issues/521
The other crates in this repository use zenithdb/rust-postgres as a
dependency for the related items, instead of the crates.io versions.
Switching to using that for the proxy as well removes an additional
three dependencies when we compile. (319 -> 316)
by binding sockets before daemonization
also use less annoying error reporting by not printing full error
messages for connect errors in first several connection retries
closes#507
In order to exclude problems with synchronizing disk and memory logical
size is not stored in metadata on disk. It is calculated on timeline
"start" by scanning the contents of layered repo and then size is maintained
via an atomic variable.
This patch also adds new endpoint to pageserver http api: branch detail.
It allows retrieval of a particular branch info by its name. Size info
is also added to the response of the endpoint and used in tests.
- Reorder the structs and functions
- Delegate many of the operations in LayerMap to SegEntry. For example,
`LayerMap::insert_open` now looks up the right SegEntry struct, and
then calls `SegEntry::insert_open` on it.
- Use HashMap::entry() function with or_default() to implement the lookups
with less code
Commit 66929ad6fb added a 'generation' number to open segments stored
in the layer map, to distinguish old layers from layers that were
added to the map during checkpoint processing. But it neglected the
OpenSegEntry::cmp() function.
It seems that the cmp() function is never used by BinaryHeap, so this
didn't cause any user-visible bugs (I tried adding a panic() to the
cmp() function and it didn't fire). But it's clearly wrong and we need
to fix it, anyway.
Compare files in existing compute node's pgdata with fresh basebackup at the same lsn. We expect that content is identical, except tmp files
Use it after some tests.
1) Do epoch switch without record from new epoch, immediately after recovery --
--sync-safekeepers mode doesn't generate new records.
2) Fix commit_lsn advancement by taking into account wal we have locally --
setting it further is incorrect.
3) Report it back to walproposer so he knows when sync is done.
4) Remove system id check as it is unknown in sync mode.
And make logging slightly better.
ref #439
Change control plane code to call `postgres --sync-safekeepers` before
compute node start when safekeepers are enabled. Now `pg create` will
create an empty data directory with the proper config file. Subsequent
`pg start` will run `sync-safekeepers` and will call basebackup with
the resulting LSN. Also change few tests to accommodate this new behavior.
In a passing fix two minor issues with basabackup:
* check that we can't create branches with pre-initdb LSN's
* normalize branch LSN's that are pointing to the segment boundary
patch by @knizhnik
closes#506
Now that the page server collects this metric (since commit 212920e47e),
let's include it in the performance test results
The new metric looks like this:
performance/test_perf_pgbench.py . [100%]
--------------- Benchmark results ----------------
test_pgbench.init: 6.784 s
test_pgbench.pageserver_writes: 466 MB <---- THIS IS NEW
test_pgbench.5000_xacts: 8.196 s
test_pgbench.size: 163 MB
=============== 1 passed in 21.00s ===============
This should fix the sporadic regression test failures we've been seeing
lately with "no base img found" errors.
This fixes the common case, but one corner case is still not handled:
If a relation is extended across a segment boundary, leaving a gap block
in the segment preceding the segment containing the target block, the
preceding segment will not be padded with zeros correctly. This adds
a test case for that, but it's commented out.
See github issue https://github.com/zenithdb/zenith/issues/500
To fix, break out of the loop when you reach an in-memory layer that was
created after the checkpoint started. To do that, add a "generation"
counter into the layer map.
Fixes https://github.com/zenithdb/zenith/issues/494
Previously, the InMemoryLayer and DeltaLayer implementations of
get_page_reconstruct_data would recursively call the predecessor layer's
get_page_reconstruct_data function. Refactor so that we iterate in the
caller instead. Make get_page_reconstruct_data() return the predecessor
layer along with the continuation LSN, so that the caller can iterate.
IMO this makes the logic more clear, although this is more lines of code.
DeltaLayer uses the name `predecessor` for the same thing. Use the
same name in InMemoryLayer. The 'img_layer' name was misleading, as
the predecessor layer is not necessarily an image layer. Currently,
the 'freeze' function always creates a new image layer, but it
wouldn't have to be that way. Also, when you create a new branch, at
the branch point the predecessor layer can be a delta layer on the
ancestor branch.
* add lsn argument
* do not expose wait_lsn, wait inside list_nonrels()
* fix parameters parsing
* expose get_last_record_rlsn() to atomically read (last,prev) pair
More work is needed to correctly handle basebackup@old_lsn but current
approach already allows to fix test_restart_compute
There are two main reasons for that:
a) Latest unfinished record may disapper after compute node restart, so let's
try not leak volatile part of the WAL into the repository. Always use
last_valid_record instead.
That change requires different getPage@LSN logic in postgres -- we need
to ask LSN's that point to some complete record instead of GetFlushRecPtr()
that can point in the middle of the record. That was already done by @knizhnik
to deal with the same problem during the work on `postgres --sync-safekeepers`.
Postgres will use LSN's aligned on 0x8 boundary in get_page requests, so we
also need to be sure that last_valid_record is aligned.
b) Switch to get_last_record_lsn() in basebackup@no_lsn. When compute node
is running without safekeepers and streams WAL directly
to pageserver it is important to match basebackup LSN and LSN of replication
start. Before this commit basebackup@no_lsn was waiting for last_valid_lsn
and walreceiver started replication with last_record_lsn, which can be less.
So replication was failing since compute node doesn't have requested WAL.
Make this test look like 'test_compute_restart.sh' by @ololobus, which
was surprisingly good for checking safekeepers behavior. This test adds
an intermediate compute node start with bulk select that causes a lot of
FPI's and select itself wouldn't wait for all that WAL to be replicated.
So if we kill compute node right after that we end up with lagging safekeepers
with VCL != flush_lsn. And starting new node from that state takes special
care.
Also, run and print `pg_controldata` output after each compute node start
to eyeball lsn/checkpoint info of basebackup.
This commit only adds test without fixing the problem.
I noticed that the timeline directory contained files like this:
pg_xact_0000_0_000000000169C3C2_00000000016BB399
pg_xact_0000_0_00000000016BB399
pg_xact_0000_0_00000000016BB399_00000000016BDD06
pg_xact_0000_0_00000000016BDD06
pg_xact_0000_0_00000000016BDD06_00000000016C63AA
pg_xact_0000_0_00000000016C63AA
pg_xact_0000_0_00000000016C63AA_0000000001765226_DROPPED
pg_xact_0000_0_0000000001765226
pg_xact_0001_0_00000000016BB77E_00000000016BDD06
pg_xact_0001_0_00000000016BDD06
pg_xact_0001_0_00000000016BDD06_0000000001765226_DROPPED
pg_xact_0001_0_0000000001765226
Note how there is an image file after each DROPPED file. It's a waste of
time and space to materialize an image of the file at the point where it's
dropped, no one is going to request pages on a dropped relation. And it's
a correctness issue too: list_rels() and list_nonrels() will not consider
the relation as unlinked, unless the latest layer indicates so, and there
is no concept of a dropped image layer. That was causing test_clog_truncate
test to fail, when I adjusted the checkpointer to force a checkpoint more
aggressively.
There are a bunch more issues related to dropped rels and branching,
see https://github.com/zenithdb/zenith/issues/502. Hence this doesn't
completely fix the issue I saw with test_clog_truncate either. But it's
a start.
The comment talked about the WAL redo thread, but commit 6e22a8f709
refactored that away. The problem the comment describes probably still
exists, so keep the comment, but update the wording.
Avoid slurping entire image files into memory.
For blocky segments, we write the bytes directly to a bookfile chapter.
The blocks are a fixed size, which allows for random access.
split the page versions into two chapters:
PAGE_VERSION_METAS - a rust BTreeMap from (block #, lsn) -> page & WAL
byte ranges in PAGE_VERSIONS_CHAPTER
PAGE_VERSIONS_CHAPTER - raw page images and serialized WAL records
Once upon a time, 'page_cache.rs' contained an actual page cache, but
it hasn't for a very long time. Rename to reflect what it actually does
these days.
This provides a pytest fixture to record metrics from pytest tests. The
The recorded metrics are printed out at the end of the tests.
As a starter, this includes on small test, using pgbench. It prints out
three metrics: the initialization time, runtime of 5000 xacts, and the
repository size after the tests.
epochStartLsn is the LSN since which new proposer writes its WAL in its epoch,
let's be more explicit here.
truncate_lsn is LSN still needed by the most lagging safekeeper. restart_lsn is
terminology from pg_replicaton_slots, but here we don't really have 'restart';
hopefully truncate word makes it clearer.
1) Extract consensus logic to safekeeper.rs.
2) Change the voting flow so that acceptor tells his epoch along with giving
the vote, not before it; otherwise it might get immediately stale. #294
3) Process messages from compute atomically and sync state properly. #270
4) Use separate structs for disk and network.
ref #315
Previously, a SnapshotLayer and corresponding file on disk contained the
base image of every page in the segment at the start LSN, and all the
changes (= WAL records) in the range between start and end LSN. That was
a bit awkward, because we had to keep the base image of every page in
memory until we had accumulated enough WAL after the base image to write
out the layer. When it's time to write out a layer, we would really want
to replay the WAL to reconstruct the most recent version of each page, to
save the effort later. That's on the assumption that the client will
usually request the most recent version, not some older one.
Split the SnapshotLayer into two structs: ImageLayer and DeltaLayer. An
image layer contains a "snapshot" of the segment at one specific LSN, and
no WAL records, whereas a delta layer contains WAL records in a range of
LSNs. In order to reconstruct a page version in the delta layer, by
performing WAL redo, you also need the previous image layer. So the delta
layers are "incremental" against the previous layer.
So where previously we would create snapshot files like this:
rel_100_200
rel_200_300
rel_300_400
We now create image and delta files like this:
rel_100 # image
rel_100_200 # delta
rel_200
rel_200_300
rel_300
rel_300_400
rel_400
That's more files, but as discussed above, this allows storing more
up-to-date page versions on disk, which should reduce the latency of
responding to a GetPage request. It also allows more fine-grained garbage
collection. In the above example, after the old page version are no longer
needed and if the relation is not modified anymore, we only need to keep
the latest image file, 'rel_400', and everything else can be removed.
Implements https://github.com/zenithdb/zenith/issues/339
Now that we only have one Repository implementation, no need for the
command-line options to choose it either. I'm removing these as a separate
commit to show what we will need to do if we add another Repository
implementation in the future (even though I don't foresee us doing that
any time soon)
The layered storage format is good enough that we don't need the rocksdb
implementation anymore. There are a lot of known issues but we'll keep
working on them.
Now that the new storage format is based on immutable files, we want to
implement push/pull in terms of these immutable files as well. Similarly
to how those files will be transferred between S3 and the page server.
The implementation we had was fairly tightly coupled with the object
repository implementation, but I'm about to remove the object / rocksdb
storage format soon. That would leave the current "zenith push" command
completely broken.
It seemed like a good idea at the time, but in hindsight, it was premature
to implement push/pull yet. It's a nice feature and I'd like to see it
reimplemented in the future, but in the meanwhile, let's remove the code
we had. We can dig the parts of it that might be useful in the future
from the git history.
The old policy was to flush all in-memory layers to disk every 10 seconds.
That was a pretty dumb policy, unnecessarily aggressive. This commit
changes the policy so that we only flush layers where the oldest WAL
record is older than 16 MB from the last valid LSN on the timeline. That's
still pretty aggressive, but it's a step in the right direction. We do
need a limit on how old the oldest in-memory layer is allowed to be,
because that determines how much WAL the safekeepers need to hold onto,
and how much WAL we need to reprocess in case of a page server crash.
16 MB is surely still too aggressive for that, but it's easy to change
the setting later.
To support that, keep all in-memory layers in a binary heap, so that we
can easily find the one with the oldest LSN.
This tracks and a new LSN value in the metadata file: 'disk_consistent_lsn'.
Before, on page server restart we restarted the WAL processing from the
'last_record_lsn' value, but now that we don't flush everything to disk in
one go, the 'last_record_lsn' tracked in memory is usually ahead of the
last record that's been flushed to disk. Even though we track that oldest
LSN now, the crash recovery story isn't really complete. We don't do
fsync()s anywhere, and thing will break if a snapshot file isn't complete,
as there's no CRC on them. That's not new, and it's a TODO.
Upgrade to bindgen 0.59, which has two new abilities:
- specify arbitrary #[derive] attributes to attach to generated structs
- request explicit padding fields
These two features are enough to replace transmute with serde/bincode.
Because the t_cid field was missing from the XlHeapDelete struct that
corresponds to the PostgreSQL xl_heap_delete struct, the check for the
XLH_DELETE_ALL_VISIBLE_CLEARED flag did not work correctly.
Decoding XlHeapUpdate struct was also missing the t_cid field, but that
didn't cause any immediate problems because in that struct, the t_cid
field is after all the fields that the page server cares about. But fix
that too, as it was an accident waiting to happen.
The bug was mostly hidden by the VM page handling in zenith_wallog_page,
where it forcibly generates a FPW record whenever a VM page is evicted:
else if (forknum == VISIBILITYMAP_FORKNUM && !RecoveryInProgress())
{
/*
* Always WAL-log vm.
* We should never miss clearing visibility map bits.
*
* TODO Is it too bad for performance?
* Hopefully we do not evict actively used vm too often.
*/
XLogRecPtr recptr;
recptr = log_newpage_copy(&reln->smgr_rnode.node, forknum, blocknum, buffer, false);
XLogFlush(recptr);
lsn = recptr;
But that was just hiding the issue: it's still visible if you had a
read-only node relying on the data in the page server, or you killed and
restarted the primary node, or you started a branch. In the included test
case, I used a new branch to expose this.
Fixes https://github.com/zenithdb/zenith/issues/461
- Move source tree overview into separate docs/sourcetree.md and update it.
- Add glossary: docs/glossary.md
- Add a draft of Architecture overview to main Readme.md
There can be only one "open" layer for each segment. That's the last one,
implemented by InMemoryLayer. That's the only one where new records can
be appended to. Much of the code needed to distinguish between the last
open layer and other layers anyway, so make the distinction explicit
in LayerMap.
There was a a lot of duplicated code between the get_page_at_lsn()
implementations in InMemoryLayer and SnapshotLayer. Move the code for
requesting WAL redo from the Layer trait into LayeredTimeline. The
get-function in Layer now just returns the WAL records and base image
to the caller, and the caller is responsible for performing the WAL
redo on them.
Split each relish into fixed-sized 10 MB segments. Separate layers are
created for each segment. This reduces the write amplification if you
have a large relation and update only parts of it; the downside is
that you have a lot more files. The 10 MB is just a guess, we should
do some modeling and testing in the future to figure out the optimal
size.
Each segment tracks the size of the segment separately. To figure out
the total size of a relish, you need to loop through the segment to
find the highest segment that's in use. That's a bit inefficient, but
will do for now. We might want to add a cache or something later.
Change CLI so that we always create node from scratch at 'pg start'.
This operation preserve previously existing config
Add new flag '--config-only' to 'pg create'.
If this flag is passed, don't perform basebackup, just fill initial postgresql.conf for the node.
Track the time spent on replaying WAL records by the special Postgres
process, the time spent waiting for acces to the Postgres process (since
there is only one per tenant), and the number of records replayed.
This replaces the RocksDB based implementation with an approach using
"snapshot files" on disk, and in-memory btreemaps to hold the recent
changes.
This make the repository implementation a configuration option. You can
choose 'layered' or 'rocksdb' with "zenith init --repository-format=<format>"
The unit tests have been refactored to exercise both implementations.
'layered' is now the default.
Push/pull is not implemented. The 'test_history_inmemory' test has been
commented out accordingly. It's not clear how we will implement that
functionality; probably by copying the snapshot files directly.
Most of the work here was done on the postgres side. There's more
information in the commit message there.
(see: 04cfa326a5)
On the WAL acceptor side, we're now expecting 'START_WAL_PUSH' to
initialize the WAL keeper protocol. Everything else is mostly the same,
with the only real difference being that protocol messages are now
discrete CopyData messages sent over the postgres protocol.
For the sake of documentation, the full set of these messages is:
<- recv: START_WAL_PUSH query
<- recv: server info from postgres (type `ServerInfo`)
-> send: walkeeper info (type `SafeKeeperInfo`)
<- recv: vote info (type `RequestVote`)
if node id mismatch:
-> send: self node id (type `NodeId`); exit
-> send: confirm vote (with node id) (type `NodeId`)
loop:
<- recv: info and maybe WAL block (type `SafeKeeperRequest` + bytes)
(break loop if done)
-> send: confirm receipt (type `SafeKeeperResponse`)
My main motivation is to make it easier to attribute time spent in WAL
redo to the request that needed the WAL redo. With this patch, the WAL
redo is performed by the requester thread, so it shows up in stack traces
and in 'perf' report as part of the requester's call stack. This is also
slightly simpler (less lines of code) and should be a bit faster too.
The upcoming layered storage implementation handles GC as a
repository-wide operation because it needs to pay attention to the branch
points of all timelines.
On my laptop, the server was receiving the token as a string with extra
b'...' escaping, e.g as "b'eyJ0....0ifQA'" instead of just "eyJ0....0ifQA".
That was causing the test to fail.
I'm using Python 3.9, while the CI is using Python 3.8. I suspect that's
why. My version of pyjwt might be different too.
See also https://github.com/jpadilla/pyjwt/issues/391.
They represent files and use RelationSizeEntry to track existing and dropped files.
They can be both blocky and non-blocky.
get_relish_size() and get_rel_exists() functions work with physical relishes, not only with blocky ones.
Follow PostgreSQL logic: remove Twophase files when prepared transaction is committed/aborted.
Always store Twophase segments as materialized page images (no wal records).
Current state with authentication.
Page server validates JWT token passed as a password during connection
phase and later when performing an action such as create branch tenant
parameter of an operation is validated to match one submitted in token.
To allow access from console there is dedicated scope: PageServerApi,
this scope allows access to all tenants. See code for access validation in:
PageServerHandler::check_permission.
Because we are in progress of refactoring of communication layer
involving wal proposer protocol, and safekeeper<->pageserver. Safekeeper
now doesn’t check token passed from compute, and uses “hardcoded” token
passed via environment variable to communicate with pageserver.
Compute postgres now takes token from environment variable and passes it
as a password field in pageserver connection. It is not passed through
settings because then user will be able to retrieve it using pg_settings
or SHOW ..
I’ve added basic test in test_auth.py. Probably after we add
authentication to remaining network paths we should enable it by default
and switch all existing tests to use it.
Server functionality requires not only the "server" feature flag, but
also either "http1" or "http2" (or both). To make things simpler
(and prevent analogous problems), enable all features.
Current GC test is flaky and overly strict. Since we are migrating to the layered repo format
with different GC implementation let's just silence this test for now.
This patch has been extracted from #348, where it became unnecessary
after we had decided that we didn't want to measure anything inside
PostgresBackend.
IMO the change is good enough to make its way into the codebase,
even though it brings nothing "new" to the code.
The metrics are served by an http endpoint, which
is meant to be spawned in a new thread.
In the future the endpoint will provide more APIs,
but for the time being, we won't bother with proper routing.
This clarifies - I hope - the abstractions between Repository and
ObjectRepository. The ObjectTag struct was a mix of objects that could
be accessed directly through the public Timeline interface, and also
objects that were created and used internally by the ObjectRepository
implementation and not supposed to be accessed directly by the
callers. With the RelishTag separaate from ObjectTag, the distinction
is more clear: RelishTag is used in the public interface, and
ObjectTag is used internally between object_repository.rs and
object_store.rs, and it contains the internal metadata object types.
One awkward thing with the ObjectTag struct was that the Repository
implementation had to distinguish between ObjectTags for relations,
and track the size of the relation, while others were used to store
"blobs". With the RelishTags, some relishes are considered
"non-blocky", and the Repository implementation is expected to track
their sizes, while others are stored as blobs. I'm not 100% happy with
how RelishTag captures that either: it just knows that some relish
kinds are blocky and some non-blocky, and there's an is_block()
function to check that. But this does enable size-tracking for SLRUs,
allowing us to treat them more like relations.
This changes the way SLRUs are stored in the repository. Each SLRU
segment, e.g. "pg_clog/0000", "pg_clog/0001", are now handled as a
separate relish. This removes the need for the SLRU-specific
put_slru_truncate() function in the Timeline trait. SLRU truncation is
now handled by caling put_unlink() on the segment. This is more in
line with how PostgreSQL stores SLRUs and handles their trunction.
The SLRUs are "blocky", so they are accessed one 8k page at a time,
and repository tracks their size. I considered an alternative design
where we would treat each SLRU segment as non-blocky, and just store
the whole file as one blob. Each SLRU segment is up to 256 kB in size,
which isn't that large, so that might've worked fine, too. One reason
I didn't do that is that it seems better to have the WAL redo
routines be as close as possible to the PostgreSQL routines. It
doesn't matter much in the repository, though; we have to track the
size for relations anyway, so there's not much difference in whether
we also do it for SLRUs.
While working on this, I noticed that the CLOG and MultiXact redo code
did not handle wraparound correctly. We need to fix that, but for now,
I just commented them out with a FIXME comment.
The codepath for tenant_create command first launched the WAL redo
thread, and then called branches::create_repo() which checked if the
tenant's directory already exists. That's problematic, because
launching the WAL redo thread will run initdb if the directory doesn't
already exist. Race condition: If the tenant already exists, it will
have a WAL redo thread already running, and the old and new WAL redo
thread might try to run initdb at the same time, causing all kinds of
weird failures.
The test_pageserver_api test was failing 100% repeatably on my laptop
because of this. I'm not sure why this doesn't occur on the CI:
Jul 31 18:05:48.877 INFO running initdb in "./tenants/5227e4eb90894775ac6b8a8c76f24b2e/wal-redo-datadir", location: pageserver::walredo, pageserver/src/walredo.rs:483
thread 'WAL redo thread' panicked at 'initdb failed: The files belonging to this database system will be owned by user "heikki".
This user must also own the server process.
The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory ./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Europe/Helsinki
creating configuration files ... ok
running bootstrap script ...
stderr:
2021-07-31 15:05:48.875 GMT [282569] LOG: could not open configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf": No such file or directory
2021-07-31 15:05:48.875 GMT [282569] FATAL: configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf" contains errors
child process exited with exit code 1
initdb: removing data directory "./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir"
- Add new subdir postgres_ffi/samples/ for config file samples.
- Don't copy wal to the new branch on zenith init or zenith branch.
- Import_timeline_wal on zenith init.
It was pretty cool, but no one used it, and it had gotten badly out of
date. The main interesting thing with it was to see some basic metrics
on the fly, while the page server is running, but the metrics collection
had been broken for a long time, too. Best to just remove it.
this patch adds support for tenants. This touches mostly pageserver.
Directory layout on disk is changed to contain new layer of indirection.
Now path to particular repository has the following structure: <pageserver workdir>/tenants/<tenant
id>. Tenant id has the same format as timeline id. Tenant id is included in
pageserver commands when needed. Also new commands are available in
pageserver: tenant_list, tenant_create. This is also reflected CLI.
During init default tenant is created and it's id is saved in CLI config,
so following commands can use it without extra options. Tenant id is also included in
compute postgres configuration, so it can be passed via ServerInfo to
safekeeper and in connection string to pageserver.
For more info see docs/multitenancy.md.
It used to be the case that walkeeper's background thread
failed to recognize the end of stream (EOF) signaled by the
`Ok(None)` result of `FeMessage::read`.
* Introducing common enum ObjectVal for all values
* Rewrite push mechanism to use raw object copy
* Fix history unit test
* Add skip_nonrel_objects functions for history unit tests
It removes remaining issues with running cargo audit. There was one
error and one warning:
Crate: tokio
Version: 1.5.0
Title: Task dropped in wrong thread when aborting `LocalSet` task
Date: 2021-07-07
ID: RUSTSEC-2021-0072
URL: https://rustsec.org/advisories/RUSTSEC-2021-0072
Solution: Upgrade to >=1.5.1, <1.6.0 OR >=1.6.3, <1.7.0 OR >=1.7.2, <1.8.0 OR >=1.8.1
Crate: cpuid-bool
Version: 0.1.2
Warning: unmaintained
Title: `cpuid-bool` has been renamed to `cpufeatures`
Date: 2021-05-06
ID: RUSTSEC-2021-0064
URL: https://rustsec.org/advisories/RUSTSEC-2021-0064
When executed, pipenv shell creates a fresh Pipfile if none
is found in the current directory. This is confusing,
hence the patch to symlink it at the top level, which
is a good starting point for various commands.
Based on Konstantin's original patch (PR #275), but I introduced helper
functions for serializing/deserializing the different kinds of
ObjectValues, which made it more pleasant to use, as the deserialization
checks are now performed in the helper functions.
Add back code to parse transaction commit and abort records, and in
particular the list of dropped relations in them. Add 'put_unlink'
function to the Timeline trait and implementation. We had the code to
handle dropped relations in the GC code and elsewhere in ObjectRepository
already, but there was nothing to create the RelationSizeEntry::Unlink
tombstone entries until now. Also add a test to check that GC correctly
removes all page versions of a dropped relation.
Implements https://github.com/zenithdb/zenith/issues/232, except for the
"orphaned" rels.
Reviewed-by: Konstantin Knizhnik
Some of these were related to handling various WAL records that are not
related to any relations, like pg_multixact updates. These should have
been removed in the revert commit 6a9c036ac1, but I missed them.
Also, we didn't anything with commit/abort records. We will start
parsing commit/abort records in the next commit, but seems better to
add that from clean slate.
Reviewed-by: Konstantin Knizhnik
Without this step, the page versions won't actually be removed, they're
just marked for deletion on the next RocksDB "merge" or "compact"
operation.
Author: Konstantin Knizhnik
- Print the number of dropped relations, and the number of relations
encountered overall.
- If a block has only one page version, the latest one, don't count it as
a "truncated" version history. Only count pages for which we actually
removed some old versions.
- Change "last" to "latest" in variable names and comments. "Last" could
be interpreted as "oldest", but here it means "newest".
- Add a comment noting that the GC code depends on get_page_at_lsn_nowait
to store the materialized page version in the repository.
- Change "last" to "latest" in variable names for clarity. "Last" could
be interpreted as the oldest, but here it means newest.
This patch aims to:
* Unify connection & querying logic of ZenithPagerserver and Postgres.
* Mitigate changes to transaction machinery introduced in `psycopg2 >= 2.9`.
Now it's possible to acquire db connection using the corresponding
method:
```python
pg = postgres.create_start('main')
conn = pg.connect()
...
conn.close()
```
This pattern can be further improved with the help of `closing`:
```python
from contextlib import closing
pg = postgres.create_start('main')
with closing(pg.connect()) as conn:
...
```
All connections produced by this method will have autocommit
enabled by default.
The clippy maintainers have not provided an easy way for projects to
configure the set of lints they would like enabled/disabled. It's
particularly bad for projects using workspaces, which can easily lead to
duplicated clippy annotations for every crate, library, binary, etc.
Add a shell script that runs clippy, with a few unhelpful lints
disabled:
new_without_default
manual_range_contains
comparison_chain
If you save this in your path under the name "cargo-zclippy" (or
whatever name you like), then you can run it as "cargo zclippy" from the
shell prompt. If your text editor has rust-analyzer integration, you can
also use this new command as a replacement for "cargo check" or "cargo
clippy" and see clippy warnings and errors right in the editor.
To simplify cloud ops, allow configuration via file.
toml is used as the config format, and the file is stored in the working
directory.
Arguments used at initialization are saved in the config file.
Config file params may be overridden by CLI arguments.
Use CLI args instead of environment variables to parameterize the
working directory and postgres distirbution.
Before this change, there was a mixture of environment variables and CLI
arguments that needed to be set. Moving to a single input simplifies
cloud configuration management.
The bool::then function was added in Rust 1.50. I'm still using 1.48 on
my laptop. We haven't decided what Rust version we will require
(https://github.com/zenithdb/zenith/issues/138), and I'll probably need
to upgrade sooner or later, but this will do for now.
It's not realistic to enable full-blown type checks
within test_runner's codebase, since the amount of
warnings revealed by mypy is overwhelming.
Tests are supposed to be easy to use, so we can't
cripple everybody's workflow for the sake of imaginary benefit.
Ultimately, the purpose of this attempt is three-fold:
* Facilitate code navigation when paired with python-language-server.
* Make method signatures apparent to a fellow programmer.
* Occasionally catch some obvious type errors.
Clear a clippy warning about manual flatten.
This isn't good error handling, but panicking is probably better than
spinning forever if stdin returns EOF.
Previously, transaction commit could happen regardless of whether
pageserver has caught up or not. This patch aims to fix that.
There are two notable changes:
1. ComputeControlPlane::new_node() now sets the
`synchronous_standby_names = 'pageserver'` parameter to delay
transaction commit until pageserver acting as a standby has
fetched and ack'd a relevant portion of WAL.
2. pageserver now has to:
- Specify the `application_name = pageserver` which matches the
one in `synchronous_standby_names`.
- Properly reply with the ack'd LSNs.
This means that some tests don't need sleeps anymore.
TODO: We should probably make this behavior configurable.
Fixes#187.
Now postgres_backend communicates with the client, passing queries to the
provided handler; we have two currently, for wal_acceptor and pageserver.
Now BytesMut is again used for writing data to avoid manual message length
calculation.
ref #118
Note the unsafety of the unsafe block, with a link to the ongoing
discussion. This doesn't try to solve the problem, but let's at least
document the status quo.
That is mandatory to correctly maintain visibility map (see issue#192).
It also makes sense to check that wal_log_hints is enabled at the pageserver side,
but for now let just check that tests will pass with this on.
- All timelines are now stored in the same rocksdb repository. The GET
functions have been taught to follow the ancestors.
- Change the way relation size is stored. Instead of inserting "tombstone"
entries for blocks that are truncated away, store relation size as
separate key-value entry for each relation
- Add an abstraction for the key-value store: ObjectStore. It allows
swapping RocksDB with some other key-value store easily. Perhaps we
will write our own storage implementation using that interface, or
perhaps we'll need a different abstraction, but this is a small
improvement over status quo in any case.
- Garbage Collection is broken and commented out. It's not clear where and
how it should be implemented.
There was no guarantee that the SELECT FOR KEY SHARE queries actually
run in parallel. With unlucky timing, one query might finish before
the next one starts, so that the server doesn't need to create a
multixact. I got a failure like that on the CI:
batch_others/test_multixact.py:56: in test_multixact
assert(int(next_multixact_id) > int(next_multixact_id_old))
E AssertionError: assert 1 > 1
E + where 1 = int('1')
E + and 1 = int('1')
This could be reproduced by adding a random sleep in the runQuery
function, to make each query run at different times.
To fix, keep the transactions open after running the queries, so that
they will surely be open concurrently. With that, we can run the
queries serially, and don't need the 'multiprocessing' module anymore.
Fixes https://github.com/zenithdb/zenith/issues/196
Move `save_decoded_record` out of the Timeline trait. The storage
implementation shouldn't need to know how to decode records.
Also move put_create_database() out of the Timeline trait. Add a new
`list_rels` function to Timeline to support it, instead.
Rename `get_relsize` to `get_rel_size`, and `get_relsize_exists` to
`get_rel_exists`. Seems nicer.
Add build dependencies and other local packages needed (Ubuntu only).
Fix some weird formatting of psql commands due to `sh` syntax
highlighting.
Improve test directions, so pytest doesn't scan the whole tree.
Drop description of the integration_tests directory since it's on its
way out.
Derive Serialize+Deserialize for RelTag, BufferTag, CacheKey. Replace
handwritten pack/unpack functions with ser, des from
zenith_utils::bin_ser (which uses the bincode crate).
There are some ugly hybrids in walredo.rs, but those functions are
already doing a lot of questionable manual byte-twiddling, so hopefully
the weirdness will go away when we get better postgres protocol
wrappers.
- Previously, we checked on first use of a timeline, whether there is
a snapshot and WAL for the timeline, and loaded it all into the
(rocksdb) repository. That's a waste of effort if we had done that
earlier already, and stopped and restarted the server. Track the
last LSN that we have loaded into the repository, and only load the
recent missing WAL after that.
- When you create a new zenith repository with "zenith init",
immediately load the initial empty postgres cluster into the rocksdb
repository. Previously, we only did that on the first connection. This
way, we don't need any "load from filesystem" codepath during normal
operation, we can assume that the repository for a timeline is always
up to date. (We might still want to use the functionality to import an
existing PostgreSQL data directory into the repository in the future,
as a separate Import feature, but not today.)
This includes the following commits:
35a1c3d521 Specify right LSN in test_createdb.py
d95e1da742 Fix issue with propagation of CREATE DATABASE to the branch
8465738aa5 [refer #167] Fix handling of pg_filenode.map files in page server
86056abd0e Fix merge conflict: set initial WAL position to second segment because of pg_resetwal
2bf2dd1d88 Add nonrelfile_utils.rs file
20b6279beb Fix restoring non-relational data during compute node startup
06f96f9600 Do not transfer WAL to computation nodes: use pg_resetwal for node startup
As well as some older changes related to storing CLOG and MultiXact data as
"pseudorelation" in the page server.
With this revert, we go back to the situtation that when you create a
new compute node, we ship *all* the WAL from the beginning of time to
the compute node. Obviously we need a better solution, like the code
that this reverts. But per discussion with Konstantin and Stas, this
stuff was still half-baked, and it's better for it to live in a branch
for now, until it's more complete and has gone through some review.
For a CI build, storing incremental build data just makes the cache
bigger, for minimal gain.
Also, for Rust < 1.52.1 there are incremental compilation bugs. CircleCI
is currently building on 1.51.
This only affects the debug build; incremental compilation isn't used on
the release build.
Add parameters to specify which kind of build; run a debug and release
variant for each job.
Eventually this will be too many jobs, but for now this is a nice start.
Also, bump the cache string to "v02" so we don't mix up our cache output
with other branches.
Parse all the command line options before calling "zenith init" and
changing current working dir. The rest of the options don't make any
difference if we're initializing a new repository, but it seems strange
and error-prone to parse some arguments at different times.
* Add ancestor_id to pg_list->branch_list output of pageserver.
* Display branching point (LSN) for each non-root branch.
* Add tests for `zenith branch`.
It's created once early in server startup, after parsing the
command-line options, and never modified afterwards. To simplify
things, pass it around as static ref, instead of making copies in all
the different structs. We still pass around a reference to it, rather
than putting it in a global variable, to allow unit testing with
different configs in the same process.
Move ReceiveWalConn into its own file. Shuffle constants around so they
are close to the protocol they're associated with, or move them into
postgres_ffi if they seem to be global constants.
It may be more robust to use the TcpStream::peek function, so do all
protocol peeking before creating the protocol object. This reveals the
next cleanup step: rename Connection, since it's no longer the parent of
SendWalConn. Now we peek at the first bytes and choose which kind of
connection object to create.
We're starting to deserialize directly from the TcpStream now, which
means that a socket error gets logged as "deserialize error". That's not
very helpful; preserve the io::Error so it can be logged.
Serialize objects directly to the stream. This allows us to remove a
bunch of buffer management code, along with the NewSerializer trait that
was a temporary bridge between the old code and the new.
The pieces are:
base Connection
SendWal
ReplicationHandler
There are lots of other changes here:
- Put the replication reader in a background thread; this gets rid
of some hacks with nonblocking mode.
- Stop manually buffering input data; use BufReader instead.
- Use BytesMut a lot less; use Read/Write traits where possible.
If we try to read a few bytes at a time, we will perform a lot more
syscalls than necessary. Wrap the socket in a BufReader, which will
buffer bytes as needed.
libpq tolerates and ignores them, but the Rust postgres client gets
confused by them in certain states. This explained the strange failure
I saw with the Copy Out protocol. I'm not sure what the condition was
exactly, but somehow the rust client got confused if it received a
ReadyForQuery message that it was not expecting.
Fixes https://github.com/zenithdb/zenith/issues/148.
Compiling a Regex is very expensive, so let's not do it on every
invocation. This was consuming a big fraction of the time in creating
a new base backup at "zenith pg create". This commits brings down the
time to run "zenith pg create" on a freshly created repository from
about 2 seconds to 1 second.
It's not worth spending much effort on optimizing things at this stage
in general, but might as well pick low-hanging fruit like this.
I'm going nuts with the pattern:
let k = iter.key().unwrap();
buf.clear();
buf.extend_from_slice(&k);
let key = CacheKey::unpack(&mut buf);
Introduce helper functions to convert a CacheKey into BytesMut, and
from [u8] into CacheKey. Reduces the boilerplate code a lot.
The helper functions create a new BytesMut on each call, whereas the old
coding could reuse a single BytesMut, so this could be a bit slower. I
haven't tried measuring it, but at least it's not immediately noticeable,
and readability is much more imporatant at this point. We can optimize
later
This isn't very exciting with the current RocksDB implementation, because
it doesn't care about the PostgreSQL 1 GB segment boundaries at all.
But I think we will care about this in the future, and more tests is
generally better anyway.
Commit 746f667311 added the 'workdir' field and the get_*_path()
functions, with the idea that we cd into the directory at page server
startup, so that the get_*_path() functions can always return paths
relative to '.', but 'workdir' shows the original path to it. Change it
so that 'conf.workdir' is always set to '.', too, and the get_*_path()
functions include 'workdir' in the returned paths. Why? Because that
allows writing unit tests without changing the current directory.
When I was working on commit 97992226d3, I initially wrote the test so
that it changed the current working directory, just like commit 746f667311
did. But that was problematic, when I tried to add another unit test that
*also* wants to change the current working dir, because they could then
not run concurrently. In fact, they could not even run serially, unless
the current directory was carefully reset after the test. So it is better
to avoid changing the current directory in tests.
Commit 746f667311 moved the "chdir" earlier in the startup sequence,
before daemonizing. But it forgot to remove a corresponding chdir call
later in the sequence when not in daemonize mode. As a result, if you
tried to start the pageserver without the --daemonize option, it always
failed with "No such file or directory" error.
We have coverage for these things in the python tests, we don't need both.
test_redo_cases() was a pretty simple case that created a couple of
table and inserted to them. We don't have another test exactly like
that, but there is enough similar stuff in the test_branch_behind and
test_pgbench tests to cover it.
test_regress() and pgbench() are redundant with the test_pg_regress and
test_pgbench python tests.
test_pageserver_two_timelines() is similar enough to the test_branch_behind
test that we don't need it. And many other tests create branches, too.
In 746f667 I "optimized" wal_acceptor tests by setting "--pageserver"
flag only on one of wal_acceptors. Which obviously will hang the system if
that wal_acceptors is down. And test_acceptors_restarts does exctly this.
Set "--pageserver" on all wal_acceptors as it was before.
- The 'pageserver' fixture now sets up the repository and starts up
the Page Server automatically. In other words, the 'pageserver'
fixture provides a Page Server that's up and running and ready to
use in tests.
- The 'pageserver' fixture now also creates a branch called 'empty',
right after initializing the repository. By convention, all the
tests start by createing a new branch off 'empty' for the test. This
allows running all the tests against the same Page Server
concurrently. (I haven't tested that though. pytest doensn't
provide an option to run tests in parallel but there are extensions
for that.)
- Remove the 'zen_simple' fixture. Now that 'pageserver' provides
server that's up and running, it's pretty simple to use the
'pageserver' and 'postgres' fixtures directly.
- Don't assume host name or ports in the tests. They now use the
fields in the fixtures for that. That allows assigning the ports
dynamically, making it possible to run multiple page servers in
parallel, or running the tests in parallel with another page
server. This commit still hard codes the Page Server's port in the
fixture, though, so more work is needed to actually make it
possible.
- I made some changes to the 'postgres' fixture in commit 532918e13d,
which broke the other tests. Fix them.
- Divide the tests into two "batches" of roughly equal runtime, which
can be run in parallel
- Merge the 'test_file' and 'test_filter' options in CircleCI config
into one 'test_selection' option, for simplicity.
This patch started as an effort to support CLI working against remote
pageserver, but turned into a pretty big refactoring.
* CLI now does not look into repository files directly. New commands
'branch_create' and 'identify_system' were introduced into page_service to
support that.
* Branch management that was scattered between local_env and
zenith/main.rs is moved into pageserver/branches.rs. That code could better fit
in Repository/Timeline impl, but I'll leave that for a different patch.
* All tests-related code from local_env went into integration_tests/src/lib.rs as an
extension to PostgresNode trait.
* Paths-generating functions were concentrated around corresponding config
types (LocalEnv and PageserverConf).
When creating a new branch, we copied all WAL from the source timeline
to the new one, and it was being picked up and digested into the
repository on first use of the timeline. Fix by copying the WAL only
up to the branch's starting point.
We should probably move the branch-creation code from the CLI to page
server itself - that's what I was starting to hack on when I noticed this
bug - but let's fix this first.
Add a regression test. To test multiple branches, enhance the python
test fixture to manage multiple running Postgres instances. Also, for
convenience, add a function to the postgres fixture to open a connection
to the server with psycopg2.
This isn't just cosmetic, this also fixes one bug: the code in
parse_point_in_time() function used str::parse::<u64>() to parse the
parts of the LSN string (e.g. 0/1A2B3C4D). That's wrong, because the
LSN consists of hex digits, not base-10.
Commit 9ece1e863d used `slice.fill`, which isn't available until Rust
v1.50.0. I have 1.48.0 installed, so it was failing to compile for me.
We haven't really standardized on any particular Rust version, and if
there's a good feature we need in a recent version, let's bump up the
minimum requirement. But this is simple enough to work around.
Turn WalRedoManager into an abstract trait, so that it can be easily
mocked in unit tests.
One change here is that the WAL redo manager is no longer tied to a
specific zenith timeline. It didn't do anything with that information
aside from using it in the dummy datadir's name. We could use any
random string for that purpose, it's just to prevent two WAL redo
managers from stepping over each other. But this commit actually
changes things so that all timelines use the same WAL redo manager, so
that's not necessary. We will probably want to maintain a pool of WAL
redo processes in the future, but for now let's keep it simple.
In the passing, fix some comments.
We used to create them under .zenith/.zenith/<timelineid>. The double
.zenith was clearly not intentional. Change it to
.zenith/timelines/<timelineid>.
Fixes https://github.com/zenithdb/zenith/issues/127
It's a bit annoying that the .zenith state can show up in multiple
places, but since this is how the regression tests run if you launch
them from the git root directory, ignore this one too.
The module comment should use "//!" instead of "///". Otherwise, it is
considered to apply to the *next* thing, in this case the "use" statement
that follows, not the file as whole. "cargo fmt" revealed this by insisting
to move the "use crate::pg_constants" line to before the comment.
Multiple fds writing to the same file doesn't work. One fd will
overwrite the output of the other fd. We were opening log files three
times (stdout, stderr, and slog).
The symptoms can be seen when the program panics; the final file will
have truncated or lost messages. After this change, all messages are
preserved. If panicking and logging are concurrent (and they definitely
can be), some of the messages may be interleaved in slightly
inconvenient ways.
File::try_clone() is essentially `dup` underneath, meaning the two will
share the same file offset.
If timeline doesn't have a valid "last valid LSN", refuse WAL streaming.
The previous behavior was to start streaming from the very beginning of
time. That was needed to support bootstrapping the page server with no
data at all (see commit bd606ab37a), but we no longer do that.
Somehow I never learned this part correctly: relative imports use the
syntax "import .file" for a file sitting in the same directory.
This error wasn't terribly obvious, but the Pylance linter is yelling at
me so I'll fix it now before anyone else notices.
I didn't think this mattered, but it does: if you add a dependency to
zenith_utils, but forget to request a feature you need, the crate will
build from the workspace root, but not by itself.
It's probably better to pull in the whole dependency tree.
This leaves one problem unsolved: the missing feature above will now be
a latent bug. If that feature gets removed later by other crates, and
then the workspace_hack Cargo.toml is updated, this missing feature will
become a build failure.
Add fixes suggested in code review.
In a previous commit, I changed the NodeId field order and types to try
to preserve the exact serialization that was happening. Unfortunately,
that serialization was incorrect and the original struct was mostly
correct.
Change uuid to be a [u8; 16] as it was intended to be a byte array; that
will clearly indicate to serde serializers that no endian swaps will
ever be needed.
Replace XLogRecPtr with Lsn in wal_service.rs .
This removes the last use of XLogSegmentOffset and XLByteToSeg, so
delete them. (replaced by Lsn::segment_offset and Lsn::segment_number.)
The C memory representation is only needed if we want to guarantee the
same memory layout as some other program. Since we're using serde to
serialize these data structures, we can let the compiler do what it
wants.
This version validates on every call that our result is exactly the same
as the previous result.
NodeId is a strange corner case: one field is serialized little-endian
and one field is serialized big-endian. Hopefully we can fix that in the
future.
This module adds two traits that implement bincode-based serialization.
BeSer implements methods for big-endian encoding/decoding.
LeSer implements methods for little-endian encoding/decoding.
Right now, the BeSer and LeSer methods have the same names, meaning you
can't `use` them both at the same time. This is intended to be a safety
mechanism: mixing big-endian and little-endian encoding in the same file
is error-prone. There are ways around this, but the easiest fix is to
put the big-endian code and little-endian code in different files or
submodules.
This is a big async -> sync conversion. Most of it is a pretty
straightforward conversion of removing `async` and `.await` and swapping
in the right std modules.
I didn't find a thread-blocking version of `Notify` so I wrote one, and
then realized that there was already a Mutex being used there, so I
deleted my Notify and just used Condvar instead.
There is one part that seems odd to me: in `handle_start_replication`
there is a place where the previous code was doing a non-blocking read;
there is no TcpStream::try_read() so I fell back on manually flipping
the socket to non-blocking mode and then back again. This seems pretty
gross, but I'm not sure exactly what to replace this with: a background
thread? Extract the fd and run select() on it to first test if it's
readable?
Switch over to a newer version of rust-postgres PR752. A few
minor changes are required:
- PgLsn::UNDEFINED -> PgLsn::from(0)
- PgTimestamp -> SystemTime
Our builds can be a little inconsistent, because Cargo doesn't deal well
with workspaces where there are multiple crates which have different
dependencies that select different features. As a workaround, copy what
other big rust projects do: add a workspace_hack crate.
This crate just pins down a set of dependencies and features that
satisfies all of the workspace crates.
The benefits are:
- running `cargo build` from one of the workspace subdirectories now
works without rebuilding anything.
- running `cargo install` works (without rebuilding anything).
- making small dependency changes is much less likely to trigger large
dependency rebuilds.
A few things that Eric commented on at PR #96:
- Use thiserror to simplify the implemention of FilePathError
- Add unit tests
- Fix a few complaints from clippy
We should track the range of LSNs that are valid in a GetPage@LSN request
somehow, but currently this is just dead code. Remove, until we get around
to actually implement it.
https://github.com/zenithdb/zenith/issues/95 tracks that.
Currently, truncation is implemented in the RocksDB repository by storing
a special sentinel entry for each page that was truncated away. Hide that
implementation detail better in the abstract Repository interface, so
that caller doesn't need to construct the special sentinel WAL record.
While we're at it, refactor the CacheEntryContent struct to an enum.
This moves things around:
- The PageCache is split into two structs: Repository and Timeline. A
Repository holds multiple Timelines. In order to get a page version,
you must first get a reference to the Repository, then the Timeline
in the repository, and finally call the get_page_at_lsn() function
on the Timeline object. This sounds complicated, but because each
connection from a compute node, and each WAL receiver, only deals
with one timeline at a time, the callers can get the reference to
the Timeline object once and hold onto it. The Timeline corresponds
most closely to the old PageCache object.
- Repository and Timeline are now abstract traits, so that we can
support multiple implementations. I don't actually expect us to have
multiple implementations for long. We have the RocksDB
implementation now, but as soon as we have a different
implementation that's usable, I expect that we will retire the
RocksDB implementation. But I think this abstraction works as good
documentation in any case: it's now easier to see what the interface
for storing and loading pages from the repository is, by looking at
the Repository/Timeline traits. They abstract traits are in
repository.rs, and the RocksDB implementation of them is in
repository/rocksdb.rs.
- page_cache.rs is now a "switchboard" to get a handle to the
repository. Currently, the page server can only handle one
repository at a time, so there isn't much there, but in the future
we might do multi-tenancy there.
Mostly we're not testing python code, so verbose python tracebacks are
unhelpful. Add --tb=short to the pytest args to cut down on the noise.
To override this during testing, set the "extra_params" parameter on the
circleci job to "--tb=auto" or "--tb=long".
The local fork of rust-s3 has some code to support Google Cloud, but
that PR no longer applies upstream, and will need significant changes
before it can be re-submitted.
In the meantime, we might as well just use the most similar upstream
release. The benefit of switching is that it fixes a feature-resolution
bug that was causing us to build 24 more crates than needed (mostly
async-std and its dependencies).
The rust cache is growing dramatically. Change the cache key to start
over.
The weird "v98" was something I'd intended to reset before landing the
circleci config. Do the sane thing and start over at v01. The intent is
that we just increment the number each time something gets broken.
Since we are now calling the syscall directly, read_pidfile can now
parse an integer.
We also verify the pid is >= 1, because calling kill on 0 or negative
values goes straight to crazytown.
default(medium): 2 CPUs, 4GB RAM.
xlarge: 8 CPUs, 16GB RAM.
Some build jobs are getting killed with signal 9. I'm guessing that this
is probably an OOM condition...
I found I had a few other .zenith directories hanging around in odd
places. I doubt we intended those directories to collect in multiple
locations, so only hide the one in the git root directory.
If there isn't any version specified for a dependency crate, Cargo may
choose a newer version. This could happen when Cargo.lock is updated
("cargo update") but can also happen unexpectedly when adding or
changing other dependencies. This can allow API-breaking changes to be
picked up, breaking the build.
To prevent this, specify versions for all dependencies. Cargo is still
allowed to pick newer versions that are (hopefully) non-breaking, by
analyzing the semver version number.
There are two special cases here:
1. serde_derive::{Serialize, Deserialize} isn't really used any more. It
was only a separate crate in the past because of compiler limitations.
Nowadays, people turn on the "derive" feature of the serde crate and
use serde::{Serialize, Deserialize}.
2. parse_duration is unmaintained and has an open security issue. (gh
iss. 87) That issue probably isn't critical for us because of where we
use that crate, but it's probably still better to pin the version so we
can't get hit with an API-breaking change at an awkward time.
It's quite hard to get python2 to exit gracefully when the code was
intended for python3, because the interpreter will SyntaxError before
running a single line of code. Thankfully, the pytest developers put a
version check in their .ini config, so that should gracefully handle
both wrong-pytest-version and wrong-python-version.
Also document the woes of trying to run the pytest version shipped by
e.g. Debian or Ubuntu.
Fetching the postgres submodule is one of the more expensive steps of
the build. Doing a shallow clone ("--depth 1") should save some time and
a lot of network bandwidth.
This does the postgres & rust builds, caching the results, and preserves
its outputs in a "workspace" for downstream test jobs (which can run in
parallel).
Pytest jobs are parameterized, so adding new pytest-based tests requires
only adding a new job to the "workflows" section at the end.
This could use some optimization:
- The "apt-get install" step is quite slow.
- The rust build step will always happen, even if only unrelated changes
are present (e.g. modified a python test file)
- Saving/restoring the rust cache (/target) is very slow (it contains
1.3GB of data)
- Saving the workspace is very slow.
- The "install" step is ugly; postgres and rust artifacts could take a
much better form.
Use pytest to manage background services, paths, and environment
variables.
Benefits:
- Tests are a little easier to write.
- Cleanup is more reliable. You can CTRL-C a test and it will still shut
down gracefully. If you manually start a conflicting process, the test
fixtures will detect this and abort at startup.
- Don't need to worry about remembering '--test-threads=1'
- Output of sub-processes can be captured to files.
- Test fixtures configure everything to operate under a single test
output directory, making it easier to capture logs in CI.
- Detects all the necessary paths if run from the git root, but can also
run from arbitrary paths by setting environment variables.
There is also a deliberately broken test (test_broken.py) that can be
used to test whether the test fixtures properly clean up after
themselves. It won't run by default; the comment at the top explains how
to enable it.
Remove the check that enforces running from the git root directory.
Discover the zenith binary path from current_exe().
Look for postgres in $POSTGRES_BIN or $CWD/tmp_install.
Previously, 'zenith init' would initialize a PostgreSQL cluster with
"initdb -D tmp", creating the temp cluster under current directory.
It moves the 'tmp' directory under the correct snapshot directory in
the zenith repository after that, but if something goes wrong in initdb,
or in the steps that follow, it could leave behind the 'tmp' directory
under current dir. Better to create the temporary directory under the
repository directory to begin with, as ".zenith/tmp".
- Move notifications to a separate job, run only on push.
- Build and test will execute on [pull_request, push].
- Use actions-rs/toolchain@v1 to get the rust toolchain.
- Add matrix hook to allow multiple toolchain versions in the future
(now set to [stable]).
- Run all the cargo tests, not just test_pageserver
Make the caller of request_redo() responsible for gathering the WAL records
to redo, and for storing the reconstructed page image back in the page
cache. This leaves the WAL redo manager purely responsible for dealing with
the postgres child process, removing its dependency on the PageCache.
Having multiple copies of the same values is a source of confusion.
Commit da9bf5dc63 fixed one race condition caused by that, for example.
See also discussion at
https://github.com/zenithdb/zenith/issues/57#issuecomment-824393470
This changes SeqWait.advance() to return the old number, and not panic if
you try to move the value backwards. The caller should check for that and
act accordingly.
Remove 'async' usage a much as feasible. Async code is harder to debug,
and mixing async and non-async code is a recipe for confusion and bugs.
There are a couple of exceptions:
- The code in walredo.rs, which needs to read and write to the child
process simultaneously, still uses async. It's more convenient there.
The 'async' usage is carefully limited to just the functions that
communicate with the child process.
- Code in walreceiver.rs that uses tokio-postgres to do streaming
replication. We have to use async there, because tokio-postgres is
async. Most rust-postgres functionality has non-async wrappers, but
not the new replication client code. The async usage is very limited
here, too: we use just block_on to call the tokio-postgres functions.
The code in 'page_service.rs' now launches a dedicated thread for each
connection.
This replaces tokio::sync:⌚:channel with std::sync:mpsc in
'seqwait.rs', to make that non-async. It's not a drop-in replacement,
though: std::sync::mpsc doesn't support multiple consumers, so we cannot
share a channel between multiple waiters. So this removes the code to
check if an existing channel can be reused, and creates a new one for
each waiter. That created another problem: BTreeMap cannot hold
duplicates, so I replaced that with BinaryHeap.
Similarly, the tokio::{mpsc, oneshot} channels used between WAL redo
manager and PageCache are replaced with std::sync::mpsc. (There is no
separate 'oneshot' channel in the standard library.)
Fixes github issue #58, and coincidentally also issue #66.
AtomicLsn is a wrapper around AtomicU64 that has load() and store()
members that are cheap (on x86, anyway) and can be safely used in any
context.
This commit uses AtomicLsn in the page cache, and fixes up some
downstream code that manually implemented LSN formatting.
There's also a bugfix to the logging in wait_lsn, which prints the
wrong lsn value.
SeqWait can use any type that is Ord + Debug + Copy. Debug is not
strictly necessary, but allows us to keep the panic message if a caller
wants the sequence number to go backwards.
This type is a zero-cost wrapper for a u64, meant to help code
communicate with precision what that value means.
It implements Display and Debug. Display "{}" will format as
"1234ABCD:5678CDEF" while Debug will format as Lsn{1234567890}.
It was only marked as async because it calls relsize_get(), but
relsize_get() will in fact never block when it's called with the max
LSN value, like put_wal_record() does. Refactor to avoid marking
put_wal_record() as 'async'.
After the rocksdb patch (commit 6aa38d3f7d), the CacheEntry struct was
used only momentarily in the communication between the page_cache and
the walredo modules. It was in fact not stored in any cache anymore.
For clarity, refactor the communication.
There is now a WalRedoManager struct, with `request_redo` function,
that can be used to request WAL replay of a particular page. It sends
a request to a queue like before, but the queue has been replaced with
tokio::sync::mpsc. Previously, the resulting page image was stored
directly in the CacheEntry, and the requestor was notified using a
condition variable. Now, the requestor includes a 'oneshot' channel in
the request, and the WAL redo manager sends the response there.
Clippy pointed out that `drop(waiters)` didn't do anything, because
there was a misplaced ";" causing `waiters` to be a unit type `()`.
This change makes it do what was intended: the lock should be dropped
first, then the wakeups should be processed.
The comment was incorrect, claiming that ZTimelineId is a 32-byte value.
It is actually 16 bytes wide. While we're at it, improve the comment,
explaining what a zenith timeline is, and why it's different from
PostgreSQL timelines.
When calling into the page cache, it was possible to wait on a blocking
mutex, which can stall the async executor.
Replace that sleep with a SeqWait::wait_for(lsn).await so that the
executor can go on with other work while we wait.
Change walreceiver_works to an AtomicBool to avoid the awkwardness of
taking the lock, then dropping it while we call wait_for and then
acquiring it again to do real work.
SeqWait adds a way to .await the arrival of some sequence number.
It provides wait_for(num) which is an async fn, and advance(num) which
is synchronous.
This should be useful in solving the page cache deadlocks, and may be
useful in other areas too.
This implementation still uses a Mutex internally, but only for a brief
critical section. If we find this code broadly useful and start to care
more about executor stalls due to unfair thread scheduling, there might
be ways to make it lock-free.
- remove needless return
- remove needless format!
- remove a few more needless clone()
- from_str_radix(_, 10) -> .parse()
- remove needless reference
- remove needless `mut`
Also manually replaced a match statement with map_err() because after
clippy was done with it, there was almost nothing left in the match
expression.
Replay XLOG_XACT_COMMIT and XLOG_XACT_ABORT records in walredo.
Don't wait for lsn catchup before walreceiver connected.
Use 'request_nonrel' branch of vendor/postgres
The postgres_ext.h isn't found in vendor/postgres, if the Postgres
was restored from cache instead of building it in vendor/postgres.
To fix, change include path to point into tmp_install/include where the
headers are installed, instead of the vendor/postgres source dir.
This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.
There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.
The repository uses the directory layout desribed as Repository format
v1 in https://github.com/zenithdb/rfcs/pull/5. It it *highly* inefficient:
- we never create new snapshots. So in practice, it's really just a base
backup of the initial empty cluster, and everything else is reconstructed
by redoing all WAL
- when you create a new timeline, the base snapshot and *all* WAL is copied
from the new timeline to the new one. There is no smarts about
referencing the old snapshots/wal from the ancestor timeline.
To support all this, this commit includes a bunch of other changes:
- Implement "basebackup" funtionality in page server. When you initialize
a new compute node with "zenith pg create", it connects to the page
server, and requests a base backup of the Postgres data directory on
that timeline. (the base backup excludes user tables, so it's not
as bad as it sounds).
- Have page server's WAL receiver write the WAL into timeline dir. This
allows running a Page Server and Compute Nodes without a WAL safekeeper,
until we get around to integrate that properly into the system. (Even
after we integrate WAL safekeeper, this is perhaps how this will operate
when you want to run the system on your laptop.)
- restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
modified to use the new format. It now also restores all WAL.
- Page server no longer scans and restores everything into memory at startup.
Instead, when the first request is made for a timeline, the timeline is
slurped into memory at that point.
- The responsibility for telling page server to "callmemaybe" was moved
into Postgres libpqpagestore code. Also, WAL producer connstring cannot
be specified in the pageserver's command line anymore.
- Having multiple "system identifiers" in the same page server is no
longer supported. I repurposed much of that code to support multiple
timelines, instead.
- Implemented very basic, incomplete, support for PostgreSQL's Extended
Query Protocol in page_service.rs. Turns out that rust-postgres'
copy_out() function always uses the extended query protocol to send
out the command, and I'm using that to stream the base backup from the
page server.
TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines/<timeline>/wal' directories.
The PostgreSQL FE/BE RowDescription message was built incorrectly,
the colums were sent in wrong order, and the command tag was missing
NULL-terminator. With these fixes, 'psql' understands the reply and
shows it correctly.
For some reason printing the Result made the error string print twice,
along with some annoying newlines. Extracting the error first gets the
expected result (just one explanation, no newlines)
Just a few more places where we can drop the .unwrap() call in favor of
`?`.
Also include a fix to the log file handling: don't open the file twice.
Writing to two fds would result in one message overwriting another.
Presumably `File.try_clone()` reduces down to `dup` on Linux.
This is a first attempt at a new error-handling strategy:
- Use anyhow::Error as the first choice for easy error handling
- Use thiserror to generate local error types for anything that
needs it (no error type is available to us) or will be inspected
or matched on by higher layers.
Resolve some basic warnings from clippy:
- useless conversion to the same type
- redundant field names in struct initialization
- redundant single-component path imports
We don't need the nightly compiler, but there's no reason it shouldn't
compile without warnings, either. I don't know why stable doesn't warn
about these, but it's cheap to suppress them.
Using the hash should allow us to change the remote repo and propagate
that change to user builds without that change becoming visible at a
random time.
It's unfortunate that we can't declare this dependency once in the
top-level Cargo.toml; that feature request is rust-lang rfc 2906.
When postgres sends us a keepalive message, send a reply so it doesn't
time out and close the connection.
The LSN values sent may need to change in the future. Currently we send:
write_lsn <= page_cache last_valid_lsn
flush_lsn <= page_cache last_valid_lsn
apply_lsn <= 0
print file errors to stderr; propagate the io::Error to the caller.
This error isn't handled very gracefully in WalAcceptorNode::drop(),
but there aren't any good options there since drop can't fail.
# This file is only read when `yapf` is run from this directory.
# Hence we only top-level directories here to avoid confusion.
# See source code for the exact file format: https://github.com/google/yapf/blob/c6077954245bc3add82dafd853a1c7305a6ebd20/yapf/yapflib/file_resources.py#L40-L43
Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes
Zenith is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
## Architecture overview
A Zenith installation consists of compute nodes and Zenith storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by Zenith storage engine.
Zenith storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
- WAL service. The service that receives WAL from compute node and ensures that it is stored durably.
Pageserver consists of:
- Repository - Zenith storage implementation.
- WAL receiver - service that receives WAL from WAL service and stores it in the repository.
- Page service - service that communicates with compute nodes and responds with pages from the repository.
- WAL redo - service that builds pages from base images and WAL records on Page service request.
## Running local installation
1. Install build dependencies and other useful packages
On Ubuntu or Debian this set of packages should be sufficient to build the code:
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
make# builds also postgres and installs it to ./tmp_install
./scripts/pytest
```
## Source tree layout
## Documentation
/walkeeper:
Now we use README files to cover design ideas and overall architecture for each module and `rustdoc` style documentation comments. See also [/docs/](/docs/) a top-level overview of all available markdown documentation.
WAL safekeeper. Written in Rust.
- [/docs/sourcetree.md](/docs/sourcetree.md) contains overview of source tree layout.
/pageserver:
To view your `rustdoc` documentation in a browser, try running `cargo doc --no-deps --open`
Page Server. Written in Rust.
### Postgres-specific terms
Depends on the modified 'postgres' binary for WAL redo.
Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used.
Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
/integration_tests:
Tests with different combinations of a Postgres compute node, WAL safekeeper and Page Server.
/mgmt-console:
Web UI to launch (modified) Postgres servers, using S3 as the backing store. Written in Python.
This is somewhat outdated, as it doesn't use the WAL safekeeper or Page Servers.
/vendor/postgres:
PostgreSQL source tree, with the modifications needed for Zenith.
/vendor/postgres/src/bin/safekeeper:
Extension (safekeeper_proxy) that runs in the compute node, and connects to the WAL safekeepers
~/git-sandbox/zenith (cli-v2)$ ./target/debug/cli start main
Creating data directory from snapshot at 0/15FFB08...
waiting for server to start....2021-04-13 09:27:43.919 EEST [984664] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2021-04-13 09:27:43.920 EEST [984664] LOG: listening on IPv6 address "::1", port 5432
2021-04-13 09:27:43.920 EEST [984664] LOG: listening on IPv4 address "127.0.0.1", port 5432
2021-04-13 09:27:43.927 EEST [984664] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-04-13 09:27:43.939 EEST [984665] LOG: database system was interrupted; last known up at 2021-04-13 09:27:33 EEST
2021-04-13 09:27:43.939 EEST [984665] LOG: creating missing WAL directory "pg_wal/archive_status"
2021-04-13 09:27:44.189 EEST [984665] LOG: database system was not properly shut down; automatic recovery in progress
2021-04-13 09:27:44.195 EEST [984665] LOG: invalid record length at 0/15FFB80: wanted 24, got 0
2021-04-13 09:27:44.195 EEST [984665] LOG: redo is not required
2021-04-13 09:27:44.225 EEST [984664] LOG: database system is ready to accept connections
Creating data directory from snapshot at 0/15FFB08...
waiting for server to start....2021-04-13 09:28:41.874 EEST [984766] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2021-04-13 09:28:41.875 EEST [984766] LOG: listening on IPv6 address "::1", port 5433
2021-04-13 09:28:41.875 EEST [984766] LOG: listening on IPv4 address "127.0.0.1", port 5433
2021-04-13 09:28:41.883 EEST [984766] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2021-04-13 09:28:41.896 EEST [984767] LOG: database system was interrupted; last known up at 2021-04-13 09:27:33 EEST
2021-04-13 09:28:42.265 EEST [984767] LOG: database system was not properly shut down; automatic recovery in progress
2021-04-13 09:28:42.269 EEST [984767] LOG: redo starts at 0/15FFB80
2021-04-13 09:28:42.272 EEST [984767] LOG: invalid record length at 0/161F4B0: wanted 24, got 0
2021-04-13 09:28:42.272 EEST [984767] LOG: redo done at 0/161F478 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-04-13 09:28:42.321 EEST [984766] LOG: database system is ready to accept connections
done
server started
Insert some a row on the 'experimental' branch:
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "select * from foo"
t
-----------------------------
inserted on the main branch
(1 row)
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "insert into foo values ('inserted on experimental')"
INSERT 0 1
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "select * from foo"
t
-----------------------------
inserted on the main branch
inserted on experimental
(2 rows)
See that the other Postgres instance is still running on 'main' branch on port 5432:
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5432 -c "select * from foo"
t
-----------------------------
inserted on the main branch
(1 row)
Everything is stored in the .zenith directory:
~/git-sandbox/zenith (cli-v2)$ ls -l .zenith/
total 12
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:28 datadirs
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:27 refs
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:28 timelines
The 'datadirs' directory contains the datadirs of the running instances:
~/git-sandbox/zenith (cli-v2)$ ls -l .zenith/datadirs/
total 8
drwx------ 18 heikki heikki 4096 Apr 13 09:27 3c0c634c1674079b2c6d4edf7c91523e
drwx------ 18 heikki heikki 4096 Apr 13 09:28 697e3c103d4b1763cd6e82e4ff361d76
~/git-sandbox/zenith (cli-v2)$ ls -l .zenith/datadirs/3c0c634c1674079b2c6d4edf7c91523e/
total 124
drwxr-xr-x 5 heikki heikki 4096 Apr 13 09:27 base
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 global
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_commit_ts
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_dynshmem
-rw------- 1 heikki heikki 4760 Apr 13 09:27 pg_hba.conf
-rw------- 1 heikki heikki 1636 Apr 13 09:27 pg_ident.conf
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:32 pg_logical
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:27 pg_multixact
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_notify
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_replslot
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_serial
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_snapshots
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_stat
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:34 pg_stat_tmp
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_subtrans
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_tblspc
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_twophase
-rw------- 1 heikki heikki 3 Apr 13 09:27 PG_VERSION
lrwxrwxrwx 1 heikki heikki 52 Apr 13 09:27 pg_wal -> ../../timelines/3c0c634c1674079b2c6d4edf7c91523e/wal
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_xact
-rw------- 1 heikki heikki 88 Apr 13 09:27 postgresql.auto.conf
-rw------- 1 heikki heikki 28688 Apr 13 09:27 postgresql.conf
-rw------- 1 heikki heikki 96 Apr 13 09:27 postmaster.opts
-rw------- 1 heikki heikki 149 Apr 13 09:27 postmaster.pid
Note how 'pg_wal' is just a symlink to the 'timelines' directory. The
datadir is ephemeral, you can delete it at any time, and it can be reconstructed
from the snapshots and WAL stored in the 'timelines' directory. So if you push/pull
the repository, the 'datadirs' are not included. (They are like git working trees)
Creating data directory from snapshot at 0/15FFB08...
waiting for server to start....2021-04-13 09:37:05.476 EEST [985340] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2021-04-13 09:37:05.477 EEST [985340] LOG: listening on IPv6 address "::1", port 5433
2021-04-13 09:37:05.477 EEST [985340] LOG: listening on IPv4 address "127.0.0.1", port 5433
2021-04-13 09:37:05.487 EEST [985340] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2021-04-13 09:37:05.498 EEST [985341] LOG: database system was interrupted; last known up at 2021-04-13 09:27:33 EEST
2021-04-13 09:37:05.808 EEST [985341] LOG: database system was not properly shut down; automatic recovery in progress
2021-04-13 09:37:05.813 EEST [985341] LOG: redo starts at 0/15FFB80
2021-04-13 09:37:05.815 EEST [985341] LOG: invalid record length at 0/161F770: wanted 24, got 0
2021-04-13 09:37:05.815 EEST [985341] LOG: redo done at 0/161F738 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-04-13 09:37:05.866 EEST [985340] LOG: database system is ready to accept connections
done
server started
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "select * from foo"
Current state of authentication includes usage of JWT tokens in communication between compute and pageserver and between CLI and pageserver. JWT token is signed using RSA keys. CLI generates a key pair during call to `zenith init`. Using following openssl commands:
CLI also generates signed token and saves it in the config for later access to pageserver. Now authentication is optional. Pageserver has two variables in config: `auth_validation_public_key_path` and `auth_type`, so when auth type present and set to `ZenithJWT` pageserver will require authentication for connections. Actual JWT is passed in password field of connection string. There is a caveat for psql, it silently truncates passwords to 100 symbols, so to correctly pass JWT via psql you have to either use PGPASSWORD environment variable, or store password in psql config file.
Currently there is no authentication between compute and safekeepers, because this communication layer is under heavy refactoring. After this refactoring support for authentication will be added there too. Now safekeeper supports "hardcoded" token passed via environment variable to be able to use callmemaybe command in pageserver.
Compute uses token passed via environment variable to communicate to pageserver and in the future to the safekeeper too.
JWT authentication now supports two scopes: tenant and pageserverapi. Tenant scope is intended for use in tenant related api calls, e.g. create_branch. Compute launched for particular tenant also uses this scope. Scope pageserver api is intended to be used by console to manage pageserver. For now we have only one management operation - create tenant.
The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax.
To recap, the problem is that the XLOG_HEAP_INSERT record does not include the command id of the inserted row. And same with deletion/update. So in the primary, a row is inserted with current xmin + cmin. But in the replica, the cmin is always set to 1. That works, because the command id is only relevant to the inserting transaction itself. After commit/abort, no one cares abut it anymore.
- Alternatives?
I don't know
2. Add PD_WAL_LOGGED.
- Why?
Postgres sometimes writes data to the page before it is wal-logged. If such page ais swapped out, we will loose this change. The problem is currently solved by setting PD_WAL_LOGGED bit in page header. When page without this bit set is written to the SMGR, then it is forced to be written to the WAL as FPI using log_newpage_copy() function.
There was wrong assumption that it can happen only during construction of some exotic indexes (like gist). It is not true. The same situation can happen with COPY,VACUUM and when record hint bits are set.
Do not store this flag in page header, but associate this bit with shared buffer. Logically it is more correct but in practice we will get not advantages: neither in space, neither in CPU overhead.
3. XLogReadBufferForRedo not always loads and pins requested buffer. So we need to add extra checks that buffer is really pinned. Also do not use BufferGetBlockNumber for buffer returned by XLogReadBufferForRedo.
- Why?
XLogReadBufferForRedo is not pinning pages which are not requested by wal-redo. It is specific only for wal-redo Postgres.
- Alternatives?
No
4. Eliminate reporting of some warnings related with hint bits, for example
"page is not marked all-visible but visibility map bit is set in relation".
- Why?
Hint bit may be not WAL logged.
- Alternative?
Always wal log any page changes.
5. Maintain last written LSN.
- Why?
When compute node requests page from page server, we need to specify LSN. Ideally it should be LSN
of WAL record performing last update of this pages. But we do not know it, because we do not have page.
We can use current WAL flush position, but in this case there is high probability that page server
will be blocked until this peace of WAL is delivered.
As better approximation we can keep max LSN of written page. It will be better to take in account LSNs only of evicted pages,
but SMGR API doesn't provide such knowledge.
- Alternatives?
Maintain map of LSNs of evicted pages.
6. Launching Postgres without WAL.
- Why?
According to Zenith architecture compute node is stateless. So when we are launching
compute node, we need to provide some dummy PG_DATADIR. Relation pages
can be requested on demand from page server. But Postgres still need some non-relational data:
control and configuration files, SLRUs,...
It is currently implemented using basebackup (do not mix with pg_basebackup) which is created
by pageserver. It includes in this tarball config/control files, SLRUs and required directories.
As far as pageserver do not have original (non-scattered) WAL segments, it includes in
this tarball dummy WAL segment which contains only SHUTDOWN_CHECKPOINT record at the beginning of segment,
which redo field points to the end of wal. It allows to load checkpoint record in more or less
standard way with minimal changes of Postgres, but then some special handling is needed,
including restoring previous record position from zenith.signal file.
Also we have to correctly initialize header of last WAL page (pointed by checkpoint.redo)
to pass checks performed by XLogReader.
- Alternatives?
We may not include fake WAL segment in tarball at all and modify xlog.c to load checkpoint record
in special way. But it may only increase number of changes in xlog.c
7. Add redo_read_buffer_filter callback to XLogReadBufferForRedoExtended
- Why?
We need a way in wal-redo Postgres to ignore pages which are not requested by pageserver.
So wal-redo Postgres reconstructs only requested page and for all other returns BLK_DONE
which means that recovery for them is not needed.
- Alternatives?
No
8. Enforce WAL logging of sequence updates.
- Why?
Due to performance reasons Postgres don't want to log each fetching of a value from a sequence,
so we pre-log a few fetches in advance. In the event of crash we can lose
(skip over) as many values as we pre-logged.
But it doesn't work with Zenith because page with sequence value can be evicted from buffer cache
and we will get a gap in sequence values even without crash.
- Alternatives:
Do not try to preserve sequential order but avoid performance penalty.
9. Treat unlogged tables as normal (permanent) tables.
- Why?
Unlogged tables are not transient, so them have to survive node restart (unlike temporary tables).
But as far as compute node is stateless, we need to persist their data to storage node.
And it can only be done through the WAL.
- Alternatives?
* Store unlogged tables locally (violates requirement of stateless compute nodes).
* Prohibit unlogged tables at all.
10. Support start Postgres in wal-redo mode
- Why?
To be able to apply WAL record and reconstruct pages at page server.
- Alternatives?
* Rewrite redo handlers in Rust
* Do not reconstruct pages at page server at all and do it at compute node.
11. WAL proposer
- Why?
WAL proposer is communicating with safekeeper and ensures WAL durability by quorum writes.
It is currently implemented as patch to standard WAL sender.
- Alternatives?
Can be moved to extension if some extra callbacks will be added to wal sender code.
12. Secure Computing BPF API wrapper.
- Why?
Pageserver delegates complex WAL decoding duties to Postgres,
which means that the latter might fall victim to carefully designed
malicious WAL records and start doing harmful things to the system.
To prevent this, it has been decided to limit possible interactions
with the outside world using the Secure Computing BPF mode.
- Alternatives:
* Rewrite redo handlers in Rust.
* Add more checks to guarantee correctness of WAL records.
* Move seccomp.c to extension
* Many other discussed approaches to neutralize incorrect WAL records vulnerabilities.
13. Callbacks for replica feedbacks
- Why?
Allowing waproposer to interact with walsender code.
- Alternatives
Copy walsender code to walproposer.
14. Support multiple SMGR implementations.
- Why?
Postgres provides abstract API for storage manager but it has only one implementation
and provides no way to replace it with custom storage manager.
- Alternatives?
None.
15. Calculate database size as sum of all database relations.
- Why?
Postgres is calculating database size by traversing data directory
but as far as Zenith compute node is stateless we can not do it.
- Alternatives?
Send this request directly to pageserver and calculate real (physical) size
of Zenith representation of database/timeline, rather than sum logical size of all relations.
-----------------------------------------------
Not currently committed but proposed:
1. Disable ring buffer buffer manager strategies
- Why?
Postgres tries to avoid cache flushing by bulk operations (copy, seqscan, vacuum,...).
Even if there are free space in buffer cache, pages may be evicted.
Negative effect of it can be somehow compensated by file system cache, but in case of Zenith
cost of requesting page from page server is much higher.
- Alternatives?
Instead of just prohibiting ring buffer we may try to implement more flexible eviction policy,
for example copy evicted page from ring buffer to some other buffer if there is free space
in buffer cache.
2. Disable marking page as dirty when hint bits are set.
- Why?
Postgres has to modify page twice: first time when some tuple is updated and second time when
hint bits are set. Wal logging hint bits updates requires FPI which significantly increase size of WAL.
- Alternatives?
Add special WAL record for setting page hints.
3. Prefetching
- Why?
As far as pages in Zenith are loaded on demand, to reduce node startup time
and also sppedup some massive queries we need some mechanism for bulk loading to
reduce page request round-trip overhead.
Currently Postgres is supporting prefetching only for bitmap scan.
In Zenith we also use prefetch for sequential and index scan. For sequential scan we prefetch
some number of following pages. For index scan we prefetch pages of heap relation addressed by TIDs.
4. Prewarming.
- Why?
Short downtime (or, in other words, fast compute node restart time) is one of the key feature of Zenith.
But overhead of request-response round-trip for loading pages on demand can make started node warm-up quite slow.
We can capture state of compute node buffer cache and send bulk request for this pages at startup.
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres).
And two intermediate images used either to reduce build time or to deliver some additional binary tools from other repos:
- [zenithdb/build](https://hub.docker.com/repository/docker/zenithdb/build) — image with all the dependencies required to build Zenith and compute node images. This image is based on `rust:slim-buster`, so it also has a proper `rust` environment. Built from [/Dockerfile.build](/Dockerfile.build).
1. Image `zenithdb/compute-tools` is re-built automatically.
2. Image `zenithdb/build` is built manually. If you want to introduce any new compile time dependencies to Zenith or compute node you have to update this image as well, build it and push to Docker Hub.
Backpressure is used to limit the lag between pageserver and compute node or WAL service.
If compute node or WAL service run far ahead of Page Server,
the time of serving page requests increases. This may lead to timeout errors.
To tune backpressure limits use `max_replication_write_lag`, `max_replication_flush_lag` and `max_replication_apply_lag` settings.
When lag between current LSN (pg_current_wal_flush_lsn() at compute node) and minimal write/flush/apply position of replica exceeds the limit
backends performing writes are blocked until the replica is caught up.
### Base image (page image)
### Basebackup
A tarball with files needed to bootstrap a compute node[] and a corresponding command to create it.
NOTE:It has nothing to do with PostgreSQL pg_basebackup.
### Branch
We can create branch at certain LSN using `zenith branch` command.
Each Branch lives in a corresponding timeline[] and has an ancestor[].
### Checkpoint (PostgreSQL)
NOTE: This is an overloaded term.
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
### Checkpoint (Layered repository)
NOTE: This is an overloaded term.
Whenever enough WAL has been accumulated in memory, the page server []
writes out the changes from in-memory layers into new layer files[]. This process
is called "checkpointing". The page server only creates layer files for
relations that have been modified since the last checkpoint.
Configuration parameter `checkpoint_distance` defines the distance
from current LSN to perform checkpoint of in-memory layers.
Default is `DEFAULT_CHECKPOINT_DISTANCE`.
Set this parameter to `0` to force checkpoint of every layer.
Configuration parameter `checkpoint_period` defines the interval between checkpoint iterations.
Default is `DEFAULT_CHECKPOINT_PERIOD`.
### Compute node
Stateless Postgres node that stores data in pageserver.
### Garbage collection
The process of removing old on-disk layers that are not needed by any timeline anymore.
### Fork
Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map.
Each PostgreSQL fork is considered a separate relish.
### Layer
A layer contains data needed to reconstruct any page versions within the
layer's Segment and range of LSNs.
There are two kinds of layers, in-memory and on-disk layers. In-memory
layers are used to ingest incoming WAL, and provide fast access
to the recent page versions. On-disk layers are stored as files on disk, and
are immutable. See pageserver/src/layered_repository/README.md for more.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
files are called "layer files". Each file corresponds to one RELISH_SEG_SIZE
segment of a PostgreSQL relation fork. There are two kinds of layer
files: image files and delta files. An image file contains a
"snapshot" of the segment at a particular LSN, and a delta file
contains WAL records applicable to the segment, in a range of LSNs.
### Layer map
The layer map tracks what layers exist for all the relishes in a timeline.
### Layered repository
Zenith repository implementation that keeps data in layers.
### LSN
The Log Sequence Number (LSN) is a unique identifier of the WAL record[] in the WAL log.
The insert position is a byte offset into the logs, increasing monotonically with each new record.
Internally, an LSN is a 64-bit integer, representing a byte position in the write-ahead log stream.
It is printed as two hexadecimal numbers of up to 8 digits each, separated by a slash.
Check also [PostgreSQL doc about pg_lsn type](https://www.postgresql.org/docs/devel/datatype-pg-lsn.html)
Values can be compared to calculate the volume of WAL data that separates them, so they are used to measure the progress of replication and recovery.
In postgres and Zenith lsns are used to describe certain points in WAL handling.
PostgreSQL LSNs and functions to monitor them:
*`pg_current_wal_insert_lsn()` - Returns the current write-ahead log insert location.
*`pg_current_wal_lsn()` - Returns the current write-ahead log write location.
*`pg_current_wal_flush_lsn()` - Returns the current write-ahead log flush location.
*`pg_last_wal_receive_lsn()` - Returns the last write-ahead log location that has been received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically.
*`pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
The Page Service listens for GetPage@LSN requests from the Compute Nodes,
and responds with pages from the repository.
### PITR (Point-in-time-recovery)
PostgreSQL's ability to restore up to a specified LSN.
### Primary node
### Proxy
Postgres protocol proxy/router.
This service listens psql port, can check auth via external service
and create new databases and accounts (control plane API in our case).
### Relation
The generic term in PostgreSQL for all objects in a database that have a name and a list of attributes defined in a specific order.
### Relish
We call each relation and other file that is stored in the
repository a "relish". It comes from "rel"-ish, as in "kind of a
rel", because it covers relations as well as other things that are
not relations, but are treated similarly for the purposes of the
storage layer.
### Replication slot
### Replica node
### Repository
Repository stores multiple timelines, forked off from the same initial call to 'initdb'
and has associated WAL redo service.
One repository corresponds to one Tenant.
### Retention policy
How much history do we need to keep around for PITR and read-only nodes?
### Segment (PostgreSQL)
NOTE: This is an overloaded term.
A physical file that stores data for a given relation. File segments are
limited in size by a compile-time setting (1 gigabyte by default), so if a
relation exceeds that size, it is split into multiple segments.
### Segment (Layered Repository)
NOTE: This is an overloaded term.
Segment is a RELISH_SEG_SIZE slice of relish (identified by a SegmentTag).
### SLRU
SLRUs include pg_clog, pg_multixact/members, and
pg_multixact/offsets. There are other SLRUs in PostgreSQL, but
they don't need to be stored permanently (e.g. pg_subtrans),
or we do not support them in zenith yet (pg_commit_ts).
Each SLRU segment is considered a separate relish[].
### Tenant (Multitenancy)
Tenant represents a single customer, interacting with Zenith.
Wal redo[] activity, timelines[], layers[] are managed for each tenant independently.
One pageserver[] can serve multiple tenants at once.
One safekeeper
See `docs/multitenancy.md` for more.
### Timeline
Timeline accepts page changes and serves get_page_at_lsn() and
get_rel_size() requests. The term "timeline" is used internally
in the system, but to users they are exposed as "branches", with
human-friendly names.
NOTE: this has nothing to do with PostgreSQL WAL timelines.
### XLOG
PostgreSQL alias for WAL[].
### WAL (Write-ahead log)
The journal that keeps track of the changes in the database cluster as user- and system-invoked operations take place. It comprises many individual WAL records[] written sequentially to WAL files[].
### WAL acceptor, WAL proposer
In the context of the consensus algorithm, the Postgres
compute node is also known as the WAL proposer, and the safekeeper is also known
as the acceptor. Those are the standard terms in the Paxos algorithm.
### WAL receiver (WAL decoder)
The WAL receiver connects to the external WAL safekeeping service (or
directly to the primary) using PostgreSQL physical streaming
replication, and continuously receives WAL. It decodes the WAL records,
and stores them to the repository.
We keep one WAL receiver active per timeline.
### WAL record
A low-level description of an individual data change.
### WAL redo
A service that runs PostgreSQL in a special wal_redo mode
to apply given WAL records over an old page image and return new page image.
### WAL safekeeper
One node that participates in the quorum. All the safekeepers
together form the WAL service.
### WAL segment (WAL file)
Also known as WAL segment or WAL segment file. Each of the sequentially-numbered files that provide storage space for WAL. The files are all of the same predefined size and are written in sequential order, interspersing changes as they occur in multiple simultaneous sessions.
### WAL service
The service as whole that ensures that WAL is stored durably.
Zenith supports multitenancy. One pageserver can serve multiple tenants at once. Tenants can be managed via zenith CLI. During page server setup tenant can be created using ```zenith init --create-tenant``` Also tenants can be added into the system on the fly without pageserver restart. This can be done using the following cli command: ```zenith tenant create``` Tenants use random identifiers which can be represented as a 32 symbols hexadecimal string. So zenith tenant create accepts desired tenant id as an optional argument. The concept of timelines/branches is working independently per tenant.
### Tenants in other commands
By default during `zenith init` new tenant is created on the pageserver. Newly created tenant's id is saved to cli config, so other commands can use it automatically if no direct arugment `--tenantid=<tenantid>` is provided. So generally tenantid more frequently appears in internal pageserver interface. Its commands take tenantid argument to distinguish to which tenant operation should be applied. CLI support creation of new tenants.
On the page server tenants introduce one level of indirection, so data directory structured the following way:
```
<pageserver working directory>
├── pageserver.log
├── pageserver.pid
├── pageserver.toml
└── tenants
├── 537cffa58a4fa557e49e19951b5a9d6b
├── de182bc61fb11a5a6b390a8aed3a804a
└── ee6016ec31116c1b7c33dfdfca38891f
```
Wal redo activity and timelines are managed for each tenant independently.
For local environment used for example in tests there also new level of indirection for tenants. It touches `pgdatadirs` directory. Now it contains `tenants` subdirectory so the structure looks the following way:
```
pgdatadirs
└── tenants
├── de182bc61fb11a5a6b390a8aed3a804a
│ └── main
└── ee6016ec31116c1b7c33dfdfca38892f
└── main
```
### Changes to postgres
Tenant id is passed to postgres via GUC the same way as the timeline. Tenant id is added to commands issued to pageserver, namely: pagestream, callmemaybe. Tenant id is also exists in ServerInfo structure, this is needed to pass the value to wal receiver to be able to forward it to the pageserver.
### Safety
For now particular tenant can only appear on a particular pageserver. Set of safekeepers are also pinned to particular (tenantid, timeline) pair so there can only be one writer for particular (tenantid, timeline).
This feature allows to migrate a timeline from one pageserver to another by utilizing remote storage capability.
### Migration process
Pageserver implements two new http handlers: timeline attach and timeline detach.
Timeline migration is performed in a following way:
1. Timeline attach is called on a target pageserver. This asks pageserver to download latest checkpoint uploaded to s3.
2. For now it is necessary to manually initialize replication stream via callmemaybe call so target pageserver initializes replication from safekeeper (it is desired to avoid this and initialize replication directly in attach handler, but this requires some refactoring (probably [#997](https://github.com/zenithdb/zenith/issues/997)/[#1049](https://github.com/zenithdb/zenith/issues/1049))
3. Replication state can be tracked via timeline detail pageserver call.
4. Compute node should be restarted with new pageserver connection string. Issue with multiple compute nodes for one timeline is handled on the safekeeper consensus level. So this is not a problem here.Currently responsibility for rescheduling the compute with updated config lies on external coordinator (console).
5. Timeline is detached from old pageserver. On disk data is removed.
### Implementation details
Now safekeeper needs to track which pageserver it is replicating to. This introduces complications into replication code:
* We need to distinguish different pageservers (now this is done by connection string which is imperfect and is covered here: https://github.com/zenithdb/zenith/issues/1105). Callmemaybe subscription management also needs to track that (this is already implemented).
* We need to track which pageserver is the primary. This is needed to avoid reconnections to non primary pageservers. Because we shouldn't reconnect to them when they decide to stop their walreceiver. I e this can appear when there is a load on the compute and we are trying to detach timeline from old pageserver. In this case callmemaybe will try to reconnect to it because replication termination condition is not met (page server with active compute could never catch up to the latest lsn, so there is always some wal tail)
Pageserver is mainly configured via a `pageserver.toml` config file.
If there's no such file during `init` phase of the server, it creates the file itself. Without 'init', the file is read.
There's a possibility to pass an arbitrary config value to the pageserver binary as an argument: such values override
the values in the config file, if any are specified for the same key and get into the final config during init phase.
### Config example
```toml
# Initial configuration file created by 'pageserver --init'
listen_pg_addr='127.0.0.1:64000'
listen_http_addr='127.0.0.1:9898'
checkpoint_distance='268435456'# in bytes
checkpoint_period='1 s'
gc_period='100 s'
gc_horizon='67108864'
max_file_descriptors='100'
# initial superuser role name to use when creating a new tenant
initial_superuser_name='zenith_admin'
# [remote_storage]
```
The config above shows default values for all basic pageserver settings.
Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank.
Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start.
Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#table) in TOML specification and
* either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'`
* or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}`
### Config values
All values can be passed as an argument to the pageserver binary, using the `-c` parameter and specified as a valid TOML string. All tables should be passed in the inline form.
Below you will find a brief overview of each subdir in the source tree in alphabetical order.
`/control_plane`:
Local control plane.
Functions to start, configure and stop pageserver and postgres instances running as a local processes.
Intended to be used in integration tests and in CLI tools for local installations.
`/docs`:
Documentaion of the Zenith features and concepts.
Now it is mostly dev documentation.
`/monitoring`:
TODO
`/pageserver`:
Zenith storage service.
The pageserver has a few different duties:
- Store and manage the data.
- Generate a tarball with files needed to bootstrap ComputeNode.
- Respond to GetPage@LSN requests from the Compute Nodes.
- Receive WAL from the WAL service and decode it.
- Replay WAL that's applicable to the chunks that the Page Server maintains
For more detailed info, see `/pageserver/README`
`/postgres_ffi`:
Utility functions for interacting with PostgreSQL file formats.
Misc constants, copied from PostgreSQL headers.
`/proxy`:
Postgres protocol proxy/router.
This service listens psql port, can check auth via external service
and create new databases and accounts (control plane API in our case).
`/test_runner`:
Integration tests, written in Python using the `pytest` framework.
`/vendor/postgres`:
PostgreSQL source tree, with the modifications needed for Zenith.
`/vendor/postgres/contrib/zenith`:
PostgreSQL extension that implements storage manager API and network communications with remote page server.
`/vendor/postgres/contrib/zenith_test_utils`:
PostgreSQL extension that contains functions needed for testing and debugging.
`/walkeeper`:
The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
It acts as a holding area and redistribution center for recently generated WAL.
For more detailed info, see `/walkeeper/README`
`/workspace_hack`:
The workspace_hack crate exists only to pin down some dependencies.
`/zenith`
Main entry point for the 'zenith' CLI utility.
TODO: Doesn't it belong to control_plane?
`/zenith_metrics`:
Helpers for exposing Prometheus metrics from the server.
`/zenith_utils`:
Helpers that are shared between other crates in this repository.
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.
A single virtual environment with all dependencies is described in the single `Pipfile`.
### Prerequisites
- Install Python 3.7 (the minimal supported version) or greater.
- Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected.
- If you have some trouble with other version you can resolve it by installing Python 3.7 separately, via pyenv or via system package manager e.g.:
```bash
# In Ubuntu
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
```
- Install `poetry`
- Exact version of `poetry` is not important, see installation instructions available at poetry's [website](https://python-poetry.org/docs/#installation)`.
- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.7 so if you have different version some linting tools can yield different result locally vs in the CI.
Run `poetry shell` to activate the virtual environment.
Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`.
### Obligatory checks
We force code formatting via `yapf` and type hints via `mypy`.
Run the following commands in the repository's root (next to `setup.cfg`):
```bash
poetry run yapf -ri . # All code is reformatted
poetry run mypy . # Ensure there are no typing errors
```
**WARNING**: do not run `mypy` from a directory other than the root of the repository.
Otherwise it will not find its configuration.
Also consider:
* Running `flake8` (or a linter of your choice, e.g. `pycodestyle`) and fixing possible defects, if any.
* Adding more type hints to your code to avoid `Any`.
### Changing dependencies
To add new package or change an existing one you can use `poetry add` or `poetry update` or edit `pyproject.toml` manually. Do not forget to run `poetry lock` in the latter case.
More details are available in poetry's [documentation](https://python-poetry.org/docs/).
InZenith,snapshotsarejustspecificpoints(LSNs)intheWALhistory,withalabel.Asnapshotpreventsgarbagecollectingolddatathat's still needed to reconstruct the database at that LSN.
</p>
<p className="todo">
TODO:
<ul>
<li>List existing snapshots</li>
<li>Create new snapshot manually, from current state or from a given LSN</li>
<li>Drill into the WAL stream to see what have happened. Provide tools for e.g. finding point where a table was dropped</li>
<li>Create snapshots automatically based on events in the WAL, like if you call pg_create_restore_point(() in the primary</li>
<li>Launch new reader instance at a snapshot</li>
<li>Export snapshot</li>
<li>Rollback cluster to a snapshot</li>
</ul>
</p>
</>
);
} else if (page === 'demo') {
return (
<>
<h1>Misc actions</h1>
<ActionButtons startOperation={ startOperation }
bucketSummary={ bucketSummary }/>
</>
);
} else if (page === 'import') {
return (
<>
<h1>Import & Export tools</h1>
<p className="TODO">TODO:
<ul>
<li>Initialize database from existing backup (pg_basebackup, WAL-G, pgbackrest)</li>
<li>Initialize from a pg_dump or other SQL script</li>
<li>Launch batch job to import data files from S3</li>
<li>Launch batch job to export database with pg_dump to S3</li>
</ul>
These jobs can be run in against reader processing nodes. We can even
spawn a new reader node dedicated to a job, and destry it when the job is done.
</p>
</>
);
} else if (page === 'jobs') {
return (
<>
<h1>Batch jobs</h1>
<p className="TODO">TODO:
<ul>
<li>List running jobs launched from Import & Export tools</li>
<li>List other batch jobs launched by the user</li>
Repository is an abstract trait, defined in `repository.rs`. It is
implemented by the LayeredRepository object in
`layered_repository.rs`. There is only that one implementation of the
Repository trait, but it's still a useful abstraction that keeps the
interface for the low-level storage functionality clean. The layered
storage format is described in layered_repository/README.md.
Each repository consists of multiple Timelines. Timeline is a
workhorse that accepts page changes from the WAL, and serves
get_page_at_lsn() and get_rel_size() requests. Note: this has nothing
to do with PostgreSQL WAL timeline. The term "timeline" is mostly
interchangeable with "branch", there is a one-to-one mapping from
branch to timeline. A timeline has a unique ID within the tenant,
represented as 16-byte hex string that never changes, whereas a
branch is a user-given name for a timeline.
Each repository also has a WAL redo manager associated with it, see
`walredo.rs`. The WAL redo manager is used to replay PostgreSQL WAL
records, whenever we need to reconstruct a page version from WAL to
satisfy a GetPage@LSN request, or to avoid accumulating too much WAL
for a page. The WAL redo manager uses a Postgres process running in
special zenith wal-redo mode to do the actual WAL redo, and
communicates with the process using a pipe.
Checkpointing / Garbage Collection
----------------------------------
Periodically, the checkpointer thread wakes up and performs housekeeping
duties on the repository. It has two duties:
### Checkpointing
Flush WAL that has accumulated in memory to disk, so that the old WAL
can be truncated away in the WAL safekeepers. Also, to free up memory
for receiving new WAL. This process is called "checkpointing". It's
similar to checkpointing in PostgreSQL or other DBMSs, but in the page
server, checkpointing happens on a per-segment basis.
### Garbage collection
Remove old on-disk layer files that are no longer needed according to the
PITR retention policy
### Backup service
The backup service, responsible for storing pageserver recovery data externally.
Currently, pageserver stores its files in a filesystem directory it's pointed to.
That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
Therefore, the server interacts with external, more reliable storage to back up and restore its state.
The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
There are the following implementations present:
* local filesystem — to use in tests mainly
* AWS S3 - to use in production
Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
The backup service is disabled by default and can be enabled to interact with a single remote storage.
CLI examples:
* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
For local S3 installations, refer to the their documentation for name format and credentials.
Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
// There should be nothing left, but let's be sure
thread_mgr::shutdown_threads(None,None,None);
info!("Shut down successfully completed");
std::process::exit(0);
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.