Compare commits

..

435 Commits

Author SHA1 Message Date
Arseny Sher
23644ed251 set pageserver id in dockerfile 2022-02-23 09:17:45 +03:00
Dmitry Rodionov
99e0f07a1d review adjustments, fancy enum for builder, minor cleanups 2022-02-23 08:33:50 +03:00
Dmitry Rodionov
5d490babf8 add node id to pageserver
This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.
2022-02-23 08:33:50 +03:00
Arseny Sher
5865f85ae2 Add --id argument to safekeeper setting its unique u64 id.
In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.

Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.
2022-02-23 08:33:50 +03:00
Dhammika Pathirana
b815f5fb9f Add no_sync check in storage
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-02-22 12:01:12 -08:00
anastasia
74a0942a77 Fix zenith feedback processing at compute node.
Add test for backpressure
2022-02-22 13:56:21 +03:00
anastasia
1a4682a04a Add 'walreceiver-after-ingest' failpoint. Use sleep at this point to imitate slow walreceiver. 2022-02-22 13:56:21 +03:00
Heikki Linnakangas
993b544ad0 Change default parameters for back pressure
Fixes issue #1238 and #1189. Extracted from PR #1194, with some comment
editorialization by me.

Author: Konstantin Knizhnik <knizhnik@zenith.tech>
2022-02-22 13:56:21 +03:00
Arthur Petukhovsky
dba1d36a4a Refactor WAL utils in safekeeper (#1290)
wal_storage.rs was split up from timeline.rs, safekeeper.rs and send_wal.rs,
and now contains all WAL related code from the safekeeper. Now there are
PhysicalStorage for persisting WAL to disk and WalReader for reading it.
This allows optimizing PhysicalStorage without affecting too much of other
code.

Also there is a separate structure for persisting control file now in
control_file.rs.
2022-02-21 17:20:53 +03:00
Bojan Serafimov
ca81a550ef Fmt 2022-02-21 16:43:28 +03:00
Bojan Serafimov
65a0b2736b Add static router 2022-02-21 16:43:28 +03:00
Bojan Serafimov
cca886682b Undo cplane change 2022-02-21 16:43:28 +03:00
Bojan Serafimov
c8f47cd38e Fix param name 2022-02-21 16:43:28 +03:00
Bojan Serafimov
92787159f7 Add client auth method option 2022-02-21 16:43:28 +03:00
anastasia
abb422d5de Fix SafekeeperMetrics parsing in python tests 2022-02-21 13:45:22 +03:00
bojanserafimov
fdc15de8b2 Add perf test: test_random_writes (#1292) 2022-02-18 15:46:29 -05:00
Stas Kelvich
207286f2b8 Actualize branching parts of openapi spec.
Previous version of spec caused parsing errors in generated clients
as return type is object not array, also one field was missing. In
a passing set `format: hex` on ancestor_id too as value conforms to
that format.
2022-02-18 20:22:21 +02:00
Dhammika Pathirana
d2b896381a Add safekeeper tenant tags in lsn/wal metrics
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add tenant_id in lsn/wal metrics (#1234)
2022-02-18 08:26:37 -08:00
Dhammika Pathirana
009f6d4ae8 Fix safekeeper metric tags
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Use separate tags in sk storage file histo (#1234)
2022-02-18 08:26:37 -08:00
Kirill Bulatov
1b31379456 Log postgres errors with ERROR level 2022-02-17 13:42:09 +02:00
Bojan Serafimov
4c64b10aec Revert removal of ignore hint 2022-02-17 13:41:49 +02:00
Bojan Serafimov
ad262a46ad Remove redundant pytest_plugins assignment 2022-02-17 13:41:49 +02:00
Kirill Bulatov
ce533835e5 Use uuid.UUID types for tenants and timelines more 2022-02-17 13:41:19 +02:00
Kirill Bulatov
e5bf520b18 Use types in zenith cli invocations in Python tests 2022-02-17 13:41:19 +02:00
Dmitry Rodionov
9512e21b9e fix python formatting 2022-02-17 13:22:14 +03:00
Dmitry Ivanov
a26d565282 [proxy] Replace private static map with a public CancelMap
This is a cleaner approach which might facilitate testing.
2022-02-17 11:54:27 +03:00
Dmitry Ivanov
a47dade622 [proxy] Migrate to async
This change makes most parts of the code asynchronous, except
for the `mgmt` subsystem (we're going to drop it anyway).

Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
2022-02-17 11:54:27 +03:00
Dmitry Rodionov
9cce430430 remove several obsolete management api commands from pageserver's libpq
api

these commands are now available via http api
2022-02-17 11:26:28 +03:00
Dhammika Pathirana
4bf4bacf01 Add cli start/stop test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add a test for #1260
2022-02-16 13:19:12 -08:00
bojanserafimov
335abfcc28 Add slow seqscan perf test (#1283) 2022-02-16 10:59:51 -05:00
bojanserafimov
afb3342e46 Add vanilla pg baseline tests (#1275) 2022-02-15 13:44:22 -05:00
Kirill Bulatov
5563ff123f Reuse tenant-timeline id struct from utils 2022-02-15 17:45:23 +02:00
Dhammika Pathirana
0a557b2fa9 Add cli v4 loopback listener ports test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add a test for #1247
2022-02-15 17:01:22 +02:00
Heikki Linnakangas
9632c352ab Avoid having multiple records for the same page and LSN.
If a heap UPDATE record modified two pages, and both pages needed to have
their VM bits cleared, and the VM bits were located on the same VM page,
we would emit two ZenithWalRecord::ClearVisibilityMapFlags records for
the same VM page. That produced warnings like this in the pageserver log:

    Page version Wal(ClearVisibilityMapFlags { heap_blkno: 18, flags: 3 }) of rel 1663/13949/2619_vm blk 0 at 2A/346046A0 already exists

To fix, change ClearVisibilityMapFlags so that it can update the bits
for both pages as one operation.

This was already covered by several python tests, so no need to add a
new one. Fixes #1125.

Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
2022-02-15 14:26:16 +02:00
Arseny Sher
328e3b4189 bump vendor/postgres to fix compiler warnings 2022-02-15 06:51:16 +03:00
Arseny Sher
47f6a1f9a8 Add -Werror to CI builds. 2022-02-15 06:51:16 +03:00
Dmitry Rodionov
a4829712f4 merge directories in git-upload instead of removing existing files for perf test result uploads 2022-02-15 03:47:06 +03:00
Arseny Sher
d4d26f619d bump vendor/postgres to fix compilation warning 2022-02-14 21:00:11 +03:00
Arseny Sher
36481f3374 bump vendor/postgres to init pgxactoff in walproposer
ref #1244
2022-02-14 15:57:38 +03:00
Dhammika Pathirana
d951dd8977 Fix cli start (#1260)
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-02-10 18:36:02 -05:00
bojanserafimov
ea13838be7 Add pgbench baseline test (#1204)
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
2022-02-10 15:33:36 -05:00
Dmitry Rodionov
b51f23cdf0 pass perf test cluster connstr to circle ci jobs 2022-02-10 17:49:54 +03:00
Kirill Bulatov
3cfcdb92ed Fix tokio features in zenith utils to enable its standalone compilation 2022-02-10 08:33:22 -05:00
Kirill Bulatov
d7af965982 Do not leak decoding_key in JwtAuth's Debug representation 2022-02-10 08:33:22 -05:00
Kirill Bulatov
7c1c7702d2 Code review fixes 2022-02-10 08:33:22 -05:00
Kirill Bulatov
6eef401602 Move routerify behind zenith_utils 2022-02-10 08:33:22 -05:00
Kirill Bulatov
c5b5905ed3 Remove parking_lot dependency from workspace 2022-02-10 08:33:22 -05:00
Kirill Bulatov
76b74349cb Bump pageserver dependencies 2022-02-10 08:33:22 -05:00
Dmitry Rodionov
b08e340f60 point perf results back from testing to master 2022-02-10 14:18:34 +03:00
Dmitry Rodionov
a25fa29bc9 modify git-upload for generate_and_push_perf_report.sh needs 2022-02-10 13:12:19 +03:00
Dmitry Rodionov
ccf3c8cc30 store performance test results in our staging cluster to be able to
visualize them in grafana
2022-02-10 13:12:19 +03:00
Heikki Linnakangas
c45ee13b4e Bump vendor/postgres, to fix memory leak.
See https://github.com/zenithdb/postgres/pull/129
2022-02-10 11:29:38 +02:00
anastasia
f1e7db9d0d Bump vendor/postgres rebased to 14.2 2022-02-10 11:19:10 +03:00
Heikki Linnakangas
fa8a6c0e94 Reduce logging of walkeeper normal operations.
It was printing a lot of stuff to the log with INFO level, for routine
things like receiving or sending messages. Reduce the noise. The amount
of logging was excessive, and it was also consuming a fair amount of CPU
(about 20% of safekeeper's CPU usage in a little test I ran).
2022-02-10 08:34:30 +02:00
Dhammika Pathirana
1e8ca497e0 Fix safekeeper loopback addr (#1247)
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-02-10 09:23:53 +03:00
Heikki Linnakangas
a504cc87ab Bump vendor/postgres for "Make getpage requests interruptible"
See https://github.com/zenithdb/zenith/issues/1224
2022-02-09 16:13:46 +02:00
Heikki Linnakangas
5268bbc840 Bump vendor/postgres for fixes to cluster size limit.
See https://github.com/zenithdb/postgres/pull/126
2022-02-09 15:52:21 +02:00
Arseny Sher
e1d770939b Bump vendor/postgres to fix recent CI failure.
See zenithdb/postgres#127
2022-02-09 08:50:45 -05:00
Egor Suvorov
2866a9e82e Fix safekeeper LSN metrics (#1216)
* Always initialize flush_lsn/commit_lsn metrics on a specific timeline, no more `n/a`
* Update flush_lsn metrics missing from cba4da3f4d
* Ensure that flush_lsn found on load is >= than both commit_lsn and truncate_lsn
* Add some debug logging
2022-02-07 20:05:16 +03:00
Kirill Bulatov
b67cddb303 Implement EphemeralFile flush in a least dangerous way 2022-02-05 22:02:59 -05:00
anastasia
cb1d84d980 Make test_timeline_size_quota more deterministic 2022-02-06 02:16:36 +03:00
anastasia
642797b69e Implement cluster size quota for zenith compute node.
Use GUC zenith.max_cluster_size to set the limit.

If limit is reached, extend requests will throw out-of-space error.
When current size is too close to the limit - throw a warning.

Add new test: test_timeline_size_quota.
2022-02-06 02:16:36 +03:00
Kirill Bulatov
3ed156a5b6 Add a CLI tool to manipulate remote storage blob files 2022-02-05 15:48:08 -05:00
Heikki Linnakangas
2d93b129a0 Avoid eprintln() in pageserver and walkeeper.
Use log::error!() instead. I spotted a few of these "connection error"
lines in the logs, without timestamps and the other stuff we print for
all other log messages.
2022-02-05 17:59:31 +02:00
Arseny Sher
32c7859659 bump vendor/postgres 2022-02-05 01:27:31 +03:00
Arseny Sher
729ac38ea8 Centralize suspending/resuming timeline activity on safekeepers.
Timeline is active whenever there is at least 1 connection from compute or
pageserver is not caught up. Currently 'active' means callmemaybes are being
sent.

Fixes race: now suspend condition checking and callmemaybe unsubscribe happen
under the same lock.
2022-02-03 02:34:10 +03:00
Andrey Taranik
d69b0539ba proxy chart staging values update for labels (#1202) 2022-02-01 13:31:05 +03:00
Dmitry Ivanov
ec78babad2 Use mold instead of default linker 2022-01-28 20:40:50 +03:00
Dmitry Ivanov
9350dfb215 [CI] Merge *.profraw files prior to uploading workspace
Hopefully, this will make CI pipeline a bit faster.
2022-01-28 19:56:28 +03:00
Dmitry Ivanov
8ac8be5206 [scripts/coverage] Implement merge command
This will drastically decrease the size of CI workspace uploads.
2022-01-28 19:56:28 +03:00
Dmitry Ivanov
c2927353a5 Enable async deserialization of FeMessage
Now it's possible to call Fe{Startup,}Message in both
sync and async contexts, which is good for proxy.

Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
2022-01-28 19:40:37 +03:00
Kirill Bulatov
33251a9d8f Disable failing remote storage tests for now 2022-01-28 18:35:46 +03:00
Konstantin Knizhnik
c045ae7a9b Fix random range for keys in test_gc_aggressive.py (#1199) 2022-01-28 16:29:55 +03:00
Dmitry Rodionov
602ccb7d5f distinguish failures for pre-initdb lsn and pre-ancestor lsn branching in test_branch_behind 2022-01-28 12:31:15 +03:00
Dmitry Rodionov
5df21e1058 remove Timeline::start_lsn in favor of ancestor_lsn 2022-01-28 12:31:15 +03:00
Konstantin Knizhnik
08135910a5 Fix checkpoint.nextXid update (#1166)
* Fix checkpoint.nextXid update

* Add test for cehckpoint.nextXid

* Fix indentation of test_next_xid.py

* Fix mypy error in test_next_xid.py

* Tidy up the test case.

* Add a unit test

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
2022-01-27 18:21:51 +03:00
Konstantin Knizhnik
f58a22d07e Freeze layers at the same end LSN (#1182)
* Freeze vectors at the same end LSN

* Fix calculation of last LSN for inmem layer

* Do not advance disk_consistent_lsn is no open layer was evicted

* Fix calculation of freeze_end_lsn

* Let start_lsn be larger than oldest_pending_lsn

* Rename 'oldest_pending_lsn' and 'last_lsn', add comments.

* Fix future_layerfiles test

* Update comments conserning olest_lsn

* Update comments conserning olest_lsn

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
2022-01-27 18:21:00 +03:00
Arthur Petukhovsky
cedde559b8 Add test for replacement of the failed safekeeper (#1179)
* Add test to replace failed safekeeper

* Restart safekeepers in test_replace_safekeeper

* Update vendor/postgres
2022-01-27 17:26:55 +03:00
Arthur Petukhovsky
49d1d1ddf9 Don't call adjust_for_wal_acceptors after pg create (#1178)
Now zenith_cli handles wal_acceptors config internally, and if we
will append wal_acceptors to postgresql.conf in python tests, then
it will contain duplicate wal_acceptors config.
2022-01-27 17:23:14 +03:00
Arseny Sher
86045ac36c Prefix per-cluster directory with ztenant_id in safekeeper.
Currently ztimelineids are unique, but all APIs accept the pair, so let's keep
it everywhere for uniformity.

Carry around ZTTId containing both ZTenantId and ZTimelineId for simplicity.

(existing clusters on staging ought to be preprocessed for that)
2022-01-27 17:22:07 +03:00
Konstantin Knizhnik
79f0e44a20 Gc cutoff rwlock (#1139)
* Reproduce github issue #1047.

* Use RwLock to protect gc_cuttof_lsn

* Eeduce number of updates in test_gc_aggressive

* Change  test_prohibit_get_page_at_lsn_for_garbage_collected_pages test

* Change  test_prohibit_get_page_at_lsn_for_garbage_collected_pages

* Lock latest_gc_cutoff_lsn in all operations accessing storage to prevent race conditions with GC

* Remove random sleep between wait_for_lsn and get_page_at_lsn

* Initialize latest_gc_cutoff with initdb_lsn and remove separate check that lsn >= initdb_lsn

* Update test_prohibit_branch_creation_on_pre_initdb_lsn test

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
2022-01-27 14:41:16 +03:00
anastasia
c44695f34b bump vendor/postgres 2022-01-27 11:20:45 +03:00
anastasia
5abe2129c6 Extend replication protocol with ZentihFeedback message
to pass current_timeline_size to compute node

Put standby_status_update fields into ZenithFeedback and send them as one message.
Pass values sizes together with keys in ZenithFeedback message.
2022-01-27 11:20:45 +03:00
Dmitry Rodionov
63dd7bce7e bandaid to avoid concurrent timeline downloading until proper refactoring/fix 2022-01-26 19:54:09 +03:00
Dmitry Rodionov
f3c73f5797 cache python deps in circle ci 2022-01-26 13:01:12 +03:00
Dmitry Rodionov
e6f2d70517 use 2021 rust edition 2022-01-25 18:48:49 +03:00
Andrey Taranik
be6d1cc360 Use zimg as builders (#1165)
* try use own builder images

* add postgres headers before build zenith

* checkout submodule before zenith build

* circleci cleanup
2022-01-25 00:58:37 +03:00
Dmitry Ivanov
703716228e Use &str instead of String in BeMessage::ErrorResponse
There's no need in allocating string literals in the heap.
2022-01-24 18:49:05 +03:00
Dmitry Rodionov
458bc0c838 walkeeper: use named type as a key in callmemaybe subscriptions hashmap 2022-01-24 17:20:15 +03:00
Dmitry Rodionov
39591ef627 reduce flakiness 2022-01-24 17:20:15 +03:00
Dmitry Rodionov
37c440c5d3 Introduce first version of tenant migraiton between pageservers
This patch includes attach/detach http endpoints in pageservers. Some
changes in callmemaybe handling inside safekeeper and an integrational
test to check migration with and without load. There are still some
rough edges that will be addressed in follow up patches
2022-01-24 17:20:15 +03:00
anastasia
81e94d1897 Add LSN and Backpressure descriptions to glossary.md 2022-01-24 12:52:30 +03:00
Konstantin Knizhnik
7bc1274a03 Fix comparison with disk_consistent_lsn in newer_image_layer_exists (#1167) 2022-01-24 12:19:18 +03:00
Dmitry Rodionov
5f5a11525c Switch our python package management solution to poetry.
Mainly because it has better support for installing the packages from
different python versions.

It also has better dependency resolver than Pipenv. And supports modern
standard for python dependency management. This includes usage of
pyproject.toml for project specific configuration instead of per
tool conf files. See following links for details:
 https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/
 https://www.python.org/dev/peps/pep-0518/
2022-01-24 11:33:47 +03:00
Konstantin Knizhnik
e209764877 Do not delete layers beyand cutoff LSN (#1128)
* Do not delete layers beyand cutoff LSN

* Update pageserver/src/layered_repository/layer_map.rs

Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>

Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
2022-01-24 10:42:40 +03:00
Kirill Bulatov
65290b2e96 Ensure every submodule compiles on its own 2022-01-21 17:34:15 +03:00
Dmitry Ivanov
127df96635 [proxy] Make NUM_BYTES_PROXIED_COUNTER more precise 2022-01-21 17:31:19 +03:00
Kirill Bulatov
924d8d489a Allow enabling S3 mock in all existing tests with an env var 2022-01-20 18:42:47 +02:00
Dmitry Rodionov
026eb64a83 Use python lib to mock s3 2022-01-20 18:42:47 +02:00
Kirill Bulatov
45124856b1 Better S3 remote storage logging 2022-01-20 18:42:47 +02:00
Kirill Bulatov
38c6f6ce16 Allow specifying custom endpoint in s3 2022-01-20 18:42:47 +02:00
Heikki Linnakangas
caa62eff2a Fix description of proxy --auth-endpoint option. 2022-01-20 14:50:27 +03:00
Dmitry Ivanov
d3542c34f1 Refactoring: use anyhow::Context's methods where possible 2022-01-19 16:33:48 +03:00
Kirill Bulatov
7fb62fc849 Fix macos compilation 2022-01-18 23:01:04 +02:00
Andrey Taranik
9d6ae06663 monitoring turn on for proxy (#1146) 2022-01-18 19:23:53 +03:00
Alexey Kondratov
06c28174c2 Integrate compute_tools into zenith workspace and improve logging (zenithdb/console#487) 2022-01-18 18:47:31 +03:00
bojanserafimov
8af1b43074 proxy: Add new metrics (#1132) 2022-01-14 19:12:43 -05:00
Heikki Linnakangas
17b7caddcb Update vendor/postgres: silence excessive logging from walproposer. 2022-01-14 20:51:02 +02:00
Heikki Linnakangas
dab30c27b6 Refactor thread management and shutdown
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.

Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.

This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
2022-01-14 18:36:10 +02:00
Heikki Linnakangas
bad1dd9759 Don't panic if spawning a new WAL receiver thread fails.
The panic would kill the page service thread. That's not too bad, but
still let's try to handle it more gracefully.
2022-01-14 18:02:34 +02:00
Heikki Linnakangas
d29836d0d5 Don't panic if spawning a thread to handle a connection fails.
Log the error and continue. Hopefully it's a transient failure.

This might have been happening in staging earlier, when the safekeeper
had a problem where it opened connections very frequently to issue
"callmemaybe" commands. If you launch too many threads too fast, you might
run out of file descriptors or something. It's not totally clear what
happened, but with commit, at least the page server will continue to run
and accept new connections, if a transient error happens.
2022-01-14 18:02:30 +02:00
Heikki Linnakangas
adb0b3dada Include backtrace in error messages in the log.
'anyhow' crate can include a backtrace in all errors, when the
'backtrace' feature is enabled. Enable it, and change the places that used
'{:#}' or '{}' to '{:?}', so that the backtrace is printed.
2022-01-14 10:10:17 +02:00
bojanserafimov
5e0f39cc9e Add proxy metrics (#1093) 2022-01-13 20:34:30 -05:00
Arthur Petukhovsky
0a34a592d5 Bump vendor/postgres (#1120) 2022-01-13 20:28:37 +03:00
Heikki Linnakangas
19aaa91f6d Timeline IDs are not globally unique, fix some code that assumed that.
A timeline ID is only guaranteed to be unique for a particular tenant,
so you need to use tenant ID + timeline ID as the key, rather than just
timeline ID.

The safekeeper currently makes the same assumption, and we should fix that
too, but this commit just addresses this one case in the page server.

In the passing, reorder some function arguments to be more consistent.
2022-01-13 18:45:30 +02:00
Konstantin Knizhnik
404aab9373 Use mutex to prevent concurrent checkpoints (#1115)
* Use mutex to prevent concurrent checkpoints

* Fix comment
2022-01-13 17:48:24 +03:00
Konstantin Knizhnik
bc6db2c10e Implement IO metrics in VirtualFile (#1112)
* Implement IO metrics in VirtualFile

* Do not group virtual file close statistics by tenantid/timelineid

* Add comments concenring close metrics
2022-01-13 17:36:53 +03:00
Heikki Linnakangas
772d853dcf Fix race condition leading to panic in walkeeper.
The walkeeper launch two threads for each connection, and uses a guard
object to remove entry from 'replicas' array, when finishes. But only
the background thread held onto the guard object, so if the background
thread finished before the other thread, the array entry would be
removed prematurely, which lead to panic in the check_stop_streaming()
call.

Fixes https://github.com/zenithdb/zenith/issues/1103
2022-01-13 11:21:11 +02:00
Arseny Sher
ab4d272149 Add safekeeper --dump-control-file option.
Hexalize zids there for better output; since Serde doesn't support several
formats for one struct, on-disk representation is changed as well, make
upgrade.rs cope with it.
2022-01-12 19:47:24 +03:00
Konstantin Knizhnik
f70a5cad61 Fix releasing of timelines lock (#1100)
refer #1087
2022-01-12 15:05:08 +03:00
anastasia
7aba299dbd Use safekeeper in test_branch_behind (#1068)
to avoid a subtle race condition.

Without safekeeper, walreceiver reconnection can stuck,
because of IO deadlock between walsender auth and regular backend.
2022-01-12 14:38:04 +03:00
Kirill Bulatov
4b3b19f444 Support prefixes when working with s3 buckets 2022-01-11 15:44:50 +02:00
Kirill Bulatov
8ab4c8a050 Code review fixes 2022-01-11 15:44:23 +02:00
Kirill Bulatov
7c4a653230 Propagate Zenith CLI's RUST_LOG env var to subprocesses 2022-01-11 15:44:23 +02:00
Kirill Bulatov
a3cd8f0e6d Add the remote storage test 2022-01-11 15:44:23 +02:00
Kirill Bulatov
65c851a451 Test pageserver's timeline http methods
z
2022-01-11 15:44:23 +02:00
Kirill Bulatov
23cf2fa984 Properly shutdown storage sync loop 2022-01-11 15:44:23 +02:00
Kirill Bulatov
ce8d6ae958 Allow using remote storage in tests 2022-01-11 15:44:23 +02:00
Kirill Bulatov
384b2a91fa Pass generic pageserver params through zenith cli 2022-01-11 15:44:23 +02:00
Arseny Sher
233c4811db Fix default safekeeper http port. 2022-01-11 10:13:27 +03:00
Konstantin Knizhnik
2fd4c390cb Do not hold timelines lock during GC (#1089)
* Do not hold timelines lock during GC
refer #1087

* Add gc_cs mutex for preveting creation of new timelines during GC

* Make clippy happy

* Use Mutex<()> instead of Mutex<i32> for GC critical section
2022-01-10 14:41:15 +03:00
bojanserafimov
5b9391b51d Support "query cancel" in proxy (#1052) 2022-01-05 17:27:12 -05:00
Arthur Petukhovsky
5a6405848d Bump vendor/postgres (#1086) 2022-01-05 14:27:51 +03:00
Patrick Insinger
191d9d2b74 par_fsync - use VirtualFile 2022-01-04 20:40:57 -08:00
Patrick Insinger
24c8dab86f pageserver - parallelize checkpoint fsyncs 2022-01-04 20:40:57 -08:00
Heikki Linnakangas
55a4cf64a1 Refactor WAL record handling.
Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL
record that is replayed with the Postgres WAL redo process, or a built-in
type that is handled entirely by pageserver code.

Replace the special code to replay Postgres XACT commit/abort records
with new Zenith WAL records. A separate zenith WAL record is created for
each modified CLOG page. This allows removing the 'main_data_offset'
field from stored PostgreSQL WAL records, which saves some memory and
some disk space in delta layers.

Introduce zenith WAL records for updating bits in the visibility map.
Previously, when e.g. a heap insert cleared the VM bit, we duplicated the
heap insert WAL record for the affected VM page. That was very wasteful.
The heap WAL record could be massive, containing a full page image in
the worst case. This addresses github issue #941.
2022-01-04 11:26:37 +02:00
Heikki Linnakangas
722667f189 Add test case for performance issue #941.
The first COPY generates about 230 MB of write I/O, but the second
COPY, after deleting most of the rows and vacuuming the rows away,
generates 370 MB of writes. Both COPYs insert the same amount of data,
so they should generate roughly the same amount of I/O. This commit
doesn't try to fix the issue, just adds a test case to demonstrate it.

Add a new 'checkpoint' command to the pageserver API. Previously,
we've used 'do_gc' for that, but many tests, including this new one,
really only want to perform a checkpoint and don't care about GC. For
now, I only used the command in the new test, though, and didn't
convert any existing tests to use it.
2022-01-04 11:26:37 +02:00
Arseny Sher
25a515b968 Don't call immediately on resume in callmemaybe.
It creates busy loop if pageserver <-> safekeeper connection fails after it was
established (e.g. currently due to 'segment checkpoint not found' error on
pageserver).

Also wake up callmemaybe thread regularly once in recall_period regardless of
channel activity.
2022-01-03 20:44:36 +03:00
Konstantin Knizhnik
1c47fbae81 Do not write image layers during enforced checkpoint (#1057)
* Do not write image layers during enforced checkpoint
refer #1056

* Add Flush option to CheckpointConfig

refer #1057
2022-01-01 19:08:09 +03:00
Alexey Kondratov
8f0cd7fb9f [compute_tools] Switch cluster_id in spec to string (zenithdb/console#72) 2021-12-29 16:35:29 +03:00
Dmitry Rodionov
c910132d4b Fix wal receiver shutdown
This patch allows to shutdown wal receiver when there are no messages
and wal receiver is blocked inside tokio-postgres. In this case it
cannot check the shutdown flag.

This patch switches to use async interface of tokio-postgres directly
without sync wrappers. It opens the possibility to use tokio::select!
between the phsycal_stream.next() and a shutdown channel readiness to
interrupt replication process.

Also this allows to shutdown only particular wal receiver without
using global shutdown_requested flag.
2021-12-29 14:42:29 +03:00
Arthur Petukhovsky
70778058d9 Add test for safekeeper setup without pageserver (#1000) 2021-12-29 12:58:27 +03:00
nikitashamgunov
a379b45257 Update README.md 2021-12-28 14:26:42 -08:00
bojanserafimov
24eca8d58b Parse cancel message in pq_proto (#1060) 2021-12-28 16:43:44 -05:00
Bojan Serafimov
1e3ddd43bc Add struct for key data 2021-12-28 22:40:22 +03:00
Bojan Serafimov
989371493b Add BeMessage::BackendKeyData variant 2021-12-28 22:40:22 +03:00
Alexey Kondratov
f64074c609 Move compute_tools from console repo (zenithdb/console#383)
Currently it's included with minimal changes and lives aside of the main
workspace. Later we may re-use and combine common parts with zenith
control_plane.

This change is mostly needed to unify cloud deployment pipeline:
1.1. build compute-tools image
1.2. build compute-node image based on the freshly built compute-tools
2. build zenith image

So we can roll new compute image and new storage required by it to
operate properly. Also it becomes easier to test console against some
specific version of compute-node/-tools.
2021-12-28 20:17:29 +03:00
anastasia
eba897ffe7 send CallmeEvent::Unsubscribe request only when pageserver is caught up with safekeeper and it's time to stop streaming 2021-12-28 17:50:48 +03:00
anastasia
5ef2b1baf7 Add new test illustrating issue with sync-safekeepers.
If safekeepers sync fast enough, callmemaybe thread may never make a call before receiving Unsubscribe request. This leads to the situation, when pageserver lacks data that exists on safekeepers.
2021-12-28 17:50:48 +03:00
Kirill Bulatov
f0afd08667 Fix zenith init defaults 2021-12-28 00:21:48 +02:00
Kirill Bulatov
b494ac1ea0 Remove redundant pageserver cli params 2021-12-27 18:38:54 +02:00
Arseny Sher
a163650a99 Refactor Postgres command parsing in safekeeper.
Do it separately with SafekeeperPostgresCommand enum as a result. Since query is
always C string, switch postgres_backend process_query argument from Bytes to
&str.

Make passing ztli/ztenant id in safekeeper connection string optional; this is
needed for upcoming intra-safekeeper heartbeat cmd which is not bound to any
timeline.
2021-12-24 15:48:13 +03:00
anastasia
980f5f8440 Propagate remote_consistent_lsn to safekeepers.
Change meaning of lsns in HOT_STANDBY_FEEDBACK:
flush_lsn = disk_consistent_lsn,
apply_lsn = remote_consistent_lsn
Update compute node backpressure configuration respectively.

Update compute node configuration:
set 'synchronous_commit=remote_write' in setup without safekeepers.
This way compute node doesn't have to wait for data checkpoint on pageserver.
This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.
2021-12-24 15:32:54 +03:00
Kirill Bulatov
42647f606e Use correct pageserver CLI parameters in docker entrypoint 2021-12-24 03:41:45 +02:00
bojanserafimov
b807570f46 Use parking_lot::Mutex instead of std::Mutex in walreceiver (#1045) 2021-12-23 14:25:44 -05:00
Kirill Bulatov
114a757d1c Use generic config parameters in pageserver cli
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
2021-12-23 18:58:28 +02:00
Andrey Taranik
9854ded56b Feature/proxy deploy (#1046)
* zenith proxy deployment

* proxy deploy ci fix

* ci cleanup or zenith proxy deploy
2021-12-23 15:53:28 +03:00
Heikki Linnakangas
fdd987c3ad Refactor the way Image- and DeltaLayers are created
Introduce builder objects, DeltaLayerWriter and ImageLayerWriter.
This gives more flexibility, as the DeltaLayer::create and
ImageLayer::create functions don't need to know about the details of
the format of where the page versions are coming from. This allows us
to change the format used in InMemoryLayer more easily, without having
to modify Delta- and ImageLayer code.

Also refactor the code in InMemoryLayer::write_to_disk for clarity.
2021-12-23 00:33:16 +02:00
Heikki Linnakangas
da62407fce Change the meaning of 'blknum' argument in Layer trait
Previously, the 'blknum' argument of various Layer functions was the
block number within the overall relation. That was pretty confusing,
because an individual layer only holds data from a one segment of the
relation. Furthermore, the 'put_truncation' function already dealt
with per-segment size, not overall relation size, adding to the
confusion.

Change the meaning of the 'blknum' argument to mean the block number
within the segment, not the overall relation.
2021-12-22 16:55:37 +02:00
Heikki Linnakangas
1cc181ca32 Fix WAL redo of commit records with subtransactions.
If a commit record contains XIDs that are stored on different CLOG pages,
we duplicate the commit record for each affected CLOG page. In the redo
routine, we must only apply the parts of the record that apply to the
CLOG page being restored. We got that right in the loop that handles the
sub-XIDs, but incorrectly always set the bit that corresponds to the main
XID.
2021-12-21 23:08:01 +02:00
Heikki Linnakangas
927587cec8 Fix comments in tests 2021-12-21 22:38:33 +02:00
Heikki Linnakangas
bcf80eaa95 Fix multixacts members WAL redo.
The logic to compute the page number was broken, and as a result, only
the first page of multixact members was updated correctly. All the
rest were left as zeros. Improve test_multixact.py to generate more
multixacts, to cover this case.

Also fix the check that the restored PG data directory matches the
original one. Previously, the test compared the 'pg_new' cluster,
which is a bit silly because the test restored the 'pg_new' cluster
only a few lines earlier, so if the multixact WAL redo is somehow
broken, the comparison will just compare two broken data directories
and report success. Change it to compare the original datadir, the one
where the multixacts were originally created, with a restored image of
the same.
2021-12-21 17:50:06 +02:00
Arthur Petukhovsky
f56db3da68 Bump vendor/postgres (#996) 2021-12-21 16:53:08 +03:00
Konstantin Knizhnik
68aa9d2715 Set utf8 encoding in initdb (#993)
refer #992
2021-12-21 15:43:34 +03:00
Konstantin Knizhnik
76777f5812 Add utility for dumping/editing metadata file (#1031) 2021-12-21 15:43:15 +03:00
Arseny Sher
56312522f9 Make safekeeper namings more consistent with reality.
s/send_wal.rs/handler.rs
s/SendWalHandler/SafekeeperPostgresHandler
s/replication.rs/send_wal.rs
2021-12-21 13:24:23 +03:00
Dmitry Rodionov
2d9d0658e8 adjust benchmarking script for go console 2021-12-20 13:54:10 +03:00
anastasia
3b61f364f7 Stop WAL streaming threads, when compute node is shut down.
WAL stream uses the 2 connections:
1. Compute node (walproposer) -> Safekeeper (ReceiveWalConn module)

When compute node is shut down, safekeeper needs to stop the respective receiving thread.
Prior to this PR it didn't work because PostgresBackend haven't handled disconnection properly.

2. Safekeeper (ReplicationConn module) -> pageserver (walreceiver thread)

When incoming WAL stream is gone, safekeeper can stop streaming WAL and cancel connection as soon as replica is caught up.
Note that the WAL can be streamed to multiple replicas simultaneously, only disconnect ones that are caught up to the last_recieved_lsn.
2021-12-20 12:34:28 +03:00
anastasia
90e5b6f983 Don't try to reconnect failed walreceiver. If necessary, wal service will send new callmemaybe request 2021-12-20 12:34:28 +03:00
Heikki Linnakangas
75cbaafb96 Remove old ephemeral files on pageserver restart.
The ephemeral files are not usable after restart, so just delete them.
Before this, you got "unrecognized filename in timeline dir" warnings
about them, as Konstantin noted at:
https://github.com/zenithdb/zenith/issues/906#issuecomment-995530870.

While we're at it, refactor away the list_files() function, moving the
logic fully into the caller. Seems more straightforward.
2021-12-17 00:00:02 +02:00
Andrey Taranik
5d5c2738a6 staging deployment flow fix (#1029) 2021-12-16 22:54:01 +03:00
Andrey Taranik
cbe155ff48 storage CI flow for staging environment (#1003)
* storage CI flow for staging environment

* prevent deploy version older than already deployed
2021-12-16 17:05:20 +03:00
Kirill Bulatov
29143b018e Disable rustc incremental compilation to avoid ICEs 2021-12-15 21:57:34 +03:00
Heikki Linnakangas
d8a367dd32 Remove dead code, fix typos. 2021-12-15 19:58:03 +02:00
Kirill Bulatov
ca60561a01 Propagate disk consistent lsn in timeline sync statuses 2021-12-15 15:13:21 +02:00
Andrey Taranik
86a409a174 cleanup circleci config after test 2021-12-15 16:08:31 +03:00
Andrey Taranik
66242f0d0e tag docker image by commit sha and add docker build for compute 2021-12-15 16:08:31 +03:00
Heikki Linnakangas
7f78e80c51 Refactor WAL ingestion code.
Rename save_decoded_record() to ingest_record(), and move the
responsibility for decoding the record into ingest_record().

Also move the responsibility of updating the CheckPoint relish to
ingest_record(). Put it in a new WalIngest struct, to help with tracking
that.
2021-12-14 20:24:03 +02:00
Heikki Linnakangas
f8f88154d5 Split restore_local_repo.rs into two files, with more descriptive names. 2021-12-14 20:24:03 +02:00
Kirill Bulatov
5cff7d1de9 Use proper download order 2021-12-14 15:32:22 +02:00
Arseny Sher
8f0cafd508 Grab safekeeper.lock on the whole directory instead of per tli.
closes #976
2021-12-13 22:11:04 +03:00
Heikki Linnakangas
e0d41ac6a3 Move constants related to metadata file to metadata.rs.
They're not used anywhere else, so seems like a better place.
2021-12-13 16:57:16 +02:00
Heikki Linnakangas
72ef59c378 Fix small typos in comments, add a comment.
The introducing paragraph README could use some more love, but let's at
least fix the typos.
2021-12-13 13:44:08 +02:00
Kirill Bulatov
673c297949 Download timelines on demand 2021-12-10 17:23:35 +02:00
Kirill Bulatov
e61732ca7c Compress checkpoint files before streaming into S3 2021-12-10 17:23:35 +02:00
Heikki Linnakangas
cb4a8396fb Use rustls rather than native-tls in all dependencies.
We depends on rustls in postgres_backend anyway, so might as well use it
for all TLS stuff. Seems better to depend on only one library both from a
security point of view, and because fewer dependencies means less code to
compile. With this commit, we no longer depend on OpenSSL.
2021-12-10 15:14:27 +02:00
Heikki Linnakangas
c77e30116e Split waldecoder.rs into two source files.
Move the code for decoding a WAL stream into WAL records into
'postgres_ffi', and keep the code to parse the WAL records deeper in
'pageserver' crate, renamed to walrecord.rs.

This tidies up the dependencies a bit. 'walkeeper' reuses the same
waldecoder routines, and it used to depend on 'pageserver' because of
that. Now it only depends on 'postgres_ffi'.

(The comment in walkeeper/Cargo.toml that claimed that the dependency was
needed for ZTimelineId was obsolete. ZTimelineId is defined in
'zenith_utils', the dependency was actually needed for the waldecoder.)
2021-12-10 15:14:13 +02:00
Heikki Linnakangas
9d369f158c Update rust-s3 to version 0.28.0
0.28.0 includes two changes I submitted to upstream:

- Add support for older ListObjects API, needed to use rust-s3 with Google
  Cloud Storage: https://github.com/durch/rust-s3/pull/229

- If file is smaller than one chunk, don't initiate multi-part upload.
  https://github.com/durch/rust-s3/pull/228

These are not critical for Zenith right now, but let's stay up-to-date.
2021-12-10 14:52:08 +02:00
Heikki Linnakangas
6ecd442fb9 Remove a bunch of unnecessary dependencies. 2021-12-10 14:24:33 +02:00
Heikki Linnakangas
f3f059c1f8 Fix a few cases where request beyond end of rel would error out.
Currently, we return an all-zeros page if you request a block beyond end of
a relation. That has been implemented in LayeredTimeline::materialize_page,
so that if Layer::get_page_reconstruct_data returns Missing, it returns
and all-zeros page.

However InMemoryLayer and DeltaLayer would return Continue, not Missing,
in that case, and materialize_page would try to find the predecessor
layer. If there was a preceding image layer, then everything would still
work, but if there wasn't, it would return a "could not find predecessor
of layer" error. Fix that in InMemoryLayer and DeltaLayer, making them
check the size of the relation and return Missing in that case.

This is hard to reproduce at the moment, but it happened quickly with
pgbench when I modified InMemoryLayer::write_to_disk so that it didn't
always create a new ImageLayer.
2021-12-09 17:46:48 +02:00
Dmitry Ivanov
8388e14bbd [scripts/git-upload] Fix logic of --forbid-overwrite 2021-12-09 14:06:17 +03:00
anastasia
5293e183c5 callmemaybe. review code cleanup 2021-12-09 13:31:49 +03:00
anastasia
93ff5f7ff0 Add default value for safekeeper --recall option. DEFAULT_RECALL_PERIOD is 1 second. 2021-12-09 13:31:49 +03:00
anastasia
41dce68bdd callmemaybe refactoring
- Don't spawn a separate thread for each connection.
Instead use one thread per safekeeper, that iterates over all connections and sends callback requests for them.

-Use tokio postgres to connect to the pageserver, to avoid spawning a new thread for each connection.

callmemaybe review fixes:
- Spawn all request_callback tasks separately.
- Remember 'last_call_time' and only send request_callback if 'recall_period' has passed.
- If task hasn't finished till next recall, abort it and try again.
- Add pause/resume CallmeEvents to avoid spamming pageserver when connection already established.
2021-12-09 13:31:49 +03:00
Dmitry Rodionov
7dece8e4a0 skip temporary table files when comparing directories in regress tests 2021-12-09 12:53:26 +03:00
Arseny Sher
37c85d5fd9 Switch safekeeper from log to tracing logging.
Add context to wal acceptor and wal sender threads showing timeline id and
unique id differentiating them.
2021-12-09 06:57:46 +03:00
nikitashamgunov
6094236171 Update README.md 2021-12-08 11:55:54 -08:00
anastasia
bb5aba42eb bump vendor/postgres to use correct backpressure commit 2021-12-08 18:57:18 +03:00
Arthur Petukhovsky
450fb9eafe Don't persist control file without sync (#966) 2021-12-07 15:02:44 +03:00
Dmitry Rodionov
557e3024cd Forward pageserver connection string from compute to safekeeper
This is needed for implementation of tenant rebalancing. With this
change safekeeper becomes aware of which pageserver is supposed to be
used for replication from this particular compute.
2021-12-06 21:28:49 +03:00
Arseny Sher
bd34d7ecfc Bump safekeeper control file version and allow reading the previous one.
Should have been a part of cba4da3f4d to provide upgrade for previously
existing clusters. Separates version independent header (magic + version) out of
SafeKeeperState to choose what to deserialize.
2021-12-06 19:47:55 +03:00
Dmitry Ivanov
0a8c672630 [CI] Fix benchmarks
Too bad we don't have a --dry-run in PRs :(
2021-12-06 13:52:28 +03:00
Dmitry Ivanov
b87ab17d05 Bump rust version to 1.56.1
Apparently, code coverage doesn't work that well in 1.55.
2021-12-06 13:27:52 +03:00
Dmitry Ivanov
d874675955 Collect coverage in CI 2021-12-06 13:27:52 +03:00
Dmitry Ivanov
5d37560308 Add bespoke glue script leveraging LLVM coverage tools 2021-12-06 13:27:52 +03:00
Dmitry Ivanov
7cec13d1df Improve shutdown story for code coverage
This patch introduces fixes for several problems affecting
LLVM-based code coverage:

* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.

* Implement proper shutdown handlers in safekeeper.
2021-12-06 13:27:52 +03:00
anastasia
b7685eb6ba Enable backpressure 2021-12-06 12:49:42 +03:00
anastasia
c7f3b4e62c Clarify the meaning of StandbyReply LSNs:
write_lsn - The last LSN received and processed by pageserver's walreceiver.
flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers.
apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).
2021-12-06 12:49:42 +03:00
Heikki Linnakangas
5bad2deff8 Don't hold 'timelines' lock over checkpoint.
It was very noticeable that you while the checkpointer was busy, you
could not e.g. open a new connection.
2021-12-03 07:42:10 -05:00
Arseny Sher
d39608c367 Fix passing start_offset to find_end_of_wal_segment. 2021-12-03 12:43:57 +03:00
Arseny Sher
cba4da3f4d Add term history to safekeepers.
Persist full history of term switches on safekeepers instead of storing only the
single term of the highest entry (called epoch). This allows easily and
correctly find the divergence point of two logs and truncate the obsolete part
before overwriting it with entries of the newer proposer(s).

Full history of the proposer is transferred in separate message before proposer
starts streaming; it is immediately persisted by safekeeper, though he might not
yet have entries for some older terms there. That's because we can't atomically
append to WAL and update the control file anyway, so locally available WAL must
be taken into account when looking at the history.

We should sometimes purge term history entries beyond truncate_lsn; this is not
done here.

Per https://github.com/zenithdb/rfcs/pull/12

Closes #296.

Bumps vendor/postgres.
2021-12-03 12:43:57 +03:00
Dmitry Rodionov
2669d140f8 use full commit sha for version info
for builds in docker this is not needed, since environment variable
with commit sha already contains full version
2021-12-01 17:35:57 +03:00
Heikki Linnakangas
f49ad33f1b Initialize 'loaded' correctly in DeltaLayer.
While we're at it, reuse the Book and the VirtualFile that's backing
it even over unload() calls. Previously, we would keep the Book open,
but on load(), we would re-open it anyway, which didn't make much
sense. Now we reuse it it. Alternatively, perhaps we should close it
on unload() to save some memory, but this I'm not going to think too
hard about it right now as the whole load/unload thing is a bit of a
hack and needs to be rewritten.

This is hard to reproduce ATM, because the incorrect state would get
fixed by an unload(). A checkpoint creates the DeltaLayer, and it also
calls unload() afterwards, so the window is not very large. I hit it
occasionally with a scale 1000 pgbench test, after I had modified
InMemoryLayer::write_to_disk() to not write an image layer every time,
which made the DeltaLayers be accessed more often.
2021-11-30 22:23:59 +02:00
Kirill Bulatov
670205e17a Evict excessively failing sync tasks, improve processing for the rest of
the tasks
2021-11-30 13:58:49 +02:00
Konstantin Knizhnik
f72d4814b1 Extract page images from FPI WAL records (#949)
* Extract page images from FPI WAL records

* Fix issues reported in review
2021-11-30 12:57:26 +03:00
Heikki Linnakangas
5ecf0664cc Fix off-by-one error in check for future delta layers.
This doesnt show up at the moment, because we never create a delta
layer with end-LSN equal to the last LSN. We always create an image
layer at that LSN instead. For example, if the latest processed LSN is
100, we would create a delta layer with end LSN 100 (exclusive), and
an image layer at 100. But that's just how InMemoryLayer::write_to_disk
happens to work at the moment, there's no fundamental reason it needs
to always create that image layer. I noticed this bug when I tried to
change the logic in InMemoryLayer::write_to_disk to only create an
image layer after a few delta layers.
2021-11-29 14:35:24 +02:00
Heikki Linnakangas
7cae265447 Fix dump_layerfile.
The VirtualFile machinery panics if it's not initialized
2021-11-29 11:26:54 +02:00
Heikki Linnakangas
5aa969a588 Replace in-memory layers and OOM-triggered eviction with temp files.
The "in-memory layer" is misnomer now, each in-memory layer is now actually
backed by a file. The files are ephemeral, in that they don't survive page
server crash or shutdown.

To avoid reading the file for every operation,
"ephemeral files" are cached in a page cache.

This includes changes from 'inmemory-layer-chunks' branch to serialize /
the page versions when they are added to the open layer. The difference is
that they are not serialized to the expandable in-memory "chunk buffer", but
written out to the file.
2021-11-26 17:25:17 +03:00
Arthur Petukhovsky
93cc40584d Shutdown socket on CopyFail (#938)
Fixes #935
2021-11-26 16:48:27 +03:00
Dmitry Rodionov
130184fee9 Prohibit branch creation and basebackup at out of scope lsns
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.

To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
  when basebackup retured an error. Previously pageserver thread just
  died.
* Remove "ancestor" file which previously contained ancestor id and
  branch lsn. Currently the same data can be obtained from metadata file.
  And just the way we handled ancestor file in the code introduced the
  case when branching fails timeline directory is created but there is no data in it
  except ancestor file. And this confused gc because it scans
  directories. So it is better to just remove ancestor file and clean up
  this timeline directory creation so it happens after all validity
  checks have passed
2021-11-25 15:27:16 +03:00
Heikki Linnakangas
d47f610606 Fix pageserver CLI parameter names and document them 2021-11-25 13:31:52 +02:00
Dmitry Rodionov
0650e51f0b add test one more case for layer visibility 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
737a557f09 add check to python tests that afteer gc number of rows is unchanged in all branches 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
6f7ebe6e01 preserve data in parent branch that might be referenced in child branch 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
70ab0d5b1f add missing script 2021-11-19 00:10:40 +03:00
Dmitry Rodionov
6ac76248cf Save performance test results from perfirmance test suit runs.
Also render reports for both staging and local runs.
2021-11-19 00:00:19 +03:00
Kirill Bulatov
b32da3b42e Use less pageserver-specific method in RemoteStorage trait 2021-11-18 22:53:40 +02:00
Dmitry Ivanov
0ccfc62e88 [proxy] Pass PostgreSQL version to client
Fixes #779
2021-11-17 16:28:44 +03:00
Dmitry Ivanov
b55cf773a8 [proxy] Streamline control- and dataflow 2021-11-17 16:28:44 +03:00
Dmitry Ivanov
43ded1c54b [proxy] Minor cleanup 2021-11-17 16:28:44 +03:00
Heikki Linnakangas
f8702d4625 Fix checking for whether segment exists on a frozen in-memory layer.
Ever since we've had frozen in-memory layers, having an 'end_lsn' no
longer means that the layer has been dropped. Need to check the 'dropped'
flag explicitly.

This was reliably causing a failure on the new 'test_parallel_copy' test
in https://github.com/zenithdb/zenith/pull/864. I'm not sure why it
doesn't happen on main branch, but the bug is pretty straightforward when
you see it.
2021-11-15 20:19:15 +02:00
Dmitry Rodionov
44111e3ba3 Prohibit branch creation at lsn that was already garbage collected.
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.
2021-11-15 20:03:16 +03:00
Patrick Insinger
298bc588f9 pageserver - don't try to GC InMemoryLayers 2021-11-15 09:01:45 -08:00
Heikki Linnakangas
4ba521f53f Add performance test case for parallel COPY TO 2021-11-15 14:49:53 +02:00
Heikki Linnakangas
431d32756b Add a buffer cache, and use it to store materialized pages.
The buffer cache is shared across all tenants, allowing memory to be
dynamically allocated where it's needed the most. The cache works on 8 kB
pages, and uses the clock algorithm for replacement policy; same as the
PostgreSQL buffer cache.

One peculiarity is that the materialized page versions can be looked up
by an inexact LSN, to find the latest page version with an LSN >= the
search key.

The code is structured to support caching other kinds of pages in the same
cache in the future, but with a different mapping key.

Co-authored-by: Patrick Insinger <patrick@zenith.tech>
2021-11-12 11:02:12 -08:00
Heikki Linnakangas
3d172d98a3 Improve layered repo README.
Add an informal overview of how it works.
2021-11-12 19:59:31 +02:00
Heikki Linnakangas
849ac791a6 Bandaid fix for "page not found" errors, when a table is loaded.
During parallel load of a table, Postgres sometimes requests a page from
the page server for which no WAL has been generated yet. That's normal;
Postgres expects the page to be full of zeros. There was a special case
for that in LayeredTimeline::materialize_page, but the problem remained
when you're crossing a segment boundary, so that there's no layer for
the segment at all.

It would be nice to have a more robust cross-check for this case. That
might need help from the Postgres side. But this extends the bandaid fix
we had in materialize_page() to the case where cross segment boundary.

Fixes https://github.com/zenithdb/zenith/issues/841
2021-11-12 18:47:39 +02:00
Alexey Kondratov
de5e6a15ae Set LD_LIBRARY_PATH in the check_restored_datadir_content() psql call
Otherwise we may use outdated system libpq.
Also print stdout/stderr if basebackup failed in check_restored_datadir_content()
2021-11-12 16:27:43 +03:00
Alexey Kondratov
0d6bf14ecb Use vendor/postgres rebased on top of REL_14_1 2021-11-12 16:27:43 +03:00
Heikki Linnakangas
d1e79c4af3 Fix locking issues in VirtualFile machinery.
There were two separate locking issues that could lead to a deadlock,
both related to holding a lock for longer than necessary:

1. In the loop in `VirtualFile::with_file`, the "handle_guard" was
held across iterations of the loop. Because of that, if the handle was
changed by a concurrent thread, the loop would try to acquire the
handle lock, when it was still holding the lock from previous
iteration. To fix, release the lock earlier. There was no need to hold
it across iterations, it was just accidental.

2. In the same function, we also held the "slot_guard" longer than
necessary. It's only needed in the first part of the loop, where we
check if the current handle is valid. If it's not, the slot lock can
be immediately released. But it was not, it was kept over the
acquisition of the handle lock. I'm not sure if that alone could cause
problems, but let's release the lock as soon as possible anyway.

Add a test case, based on Konstantin's test program to demonstrate the
deadlock.
2021-11-11 20:12:59 +02:00
Kirill Bulatov
abb2ac5246 Better context when erroring 2021-11-11 19:22:05 +02:00
Kirill Bulatov
99dbbe5f18 Allow downloading remote files partially 2021-11-11 18:51:34 +02:00
Arseny Sher
e7ca8ef5a8 Use PG timelineid 1 everywhere.
As changing it doesn't have useful meaning in Zenith.

ref #824
2021-11-11 13:53:39 +03:00
Patrick Insinger
1ce4976e36 pageserver - track size of VecMaps 2021-11-10 11:09:34 -08:00
Heikki Linnakangas
9300107cdf Cache Book objects, use virtual files to avoid running out of fds.
Currently, whenever a page version is needed from an image or delta
layer, we open the file and read and parse the bookfile headers. That's
pretty expensive. To reduce the overhead, introduce a cache of open file
descriptors, and use that to cache the Book objects so that we don't need
to read the metadata on every access.
2021-11-10 17:19:37 +02:00
Arthur Petukhovsky
9aaa02bc9a Fix high CPU usage in walproposer (#860)
* Bump vendor/postgres

* Update time limits for test_restarts_under_load
2021-11-10 17:18:07 +03:00
Arseny Sher
5603259c53 In wal_proposer_recovery, don't wait outcoming WAL to be committed.
Otherwise we're deadlocking ourselves. Oversight of 33007cc.
2021-11-10 01:38:25 +03:00
Arseny Sher
ce15c62f35 Fix 'send WAL up to' debug logging. 2021-11-10 01:38:25 +03:00
Egor Suvorov
eaff0cd568 Check python for the whole repository and improve docs (#813) 2021-11-09 22:23:29 +03:00
Egor Suvorov
587935ebed Add Safekeeper metrics tests (#746)
* zenith_fixtures.py: add SafekeeperHttpClient.get_metrics()
* Ensure that `collect_lsn` and `flush_lsn`'s reported values look reasonable in `test_many_timelines`
2021-11-09 22:18:59 +03:00
Dmitry Rodionov
07dddfed28 Use more robust way to persist safekeeper control file.
Now safekeeper control file updated in a following way:
1. Write data to temp file
2. Fsync the temporary file (if sync option is specified)
3. Rename temporary file to actual control file
4. Fsync containing directory (if sync option is specified)
5. Fsync file after rename (if sync option is specified).

Note that action 5 is not mentioned anywhere as required but it is done
in postgres this way (see durable_rename).

Also because of the rename machinery switch to use dedicated lock file
to prevent running several safekeepers concurrently on the same data

cleanup

fsync control file after rename to match postgres behaviour
2021-11-09 17:51:46 +03:00
Arseny Sher
229dc7704f Bump vendor/postgres. 2021-11-08 17:32:13 +03:00
Dmitry Rodionov
067f2ac814 fix perf repo branch name 2021-11-08 13:27:23 +03:00
Dmitry Rodionov
865870a8e5 Follow up staging benchmarking
* change zenith-perf-data checkout ref to be main
* set cluster id through secrets so there is no code changes required
  when we wipe out clusters on staging
* display full pgbench output on error
2021-11-05 14:07:11 +03:00
Arthur Petukhovsky
d19263aec8 Adjust timeouts for test_restarts_under_load (#830)
* Adjust timeouts for test_restarts_under_load

* Add test timeout for test_restarts_under_load
2021-11-04 19:58:40 +03:00
Heikki Linnakangas
6d742719a1 Fix infinite loop in looking up predecessor layer
Commit 960c7d69a8 changed the LSN returned in the Continue case in
InMemoryLayer::get_page_reconstruct_data(), but neglected to make the
same change in DeltaLayer.

Also add an escape hatch to the loop in materialize_page() to avoid
getting stuck in an infinite loop, if a bug like this reoccurs.
2021-11-04 16:07:12 +02:00
Dmitry Rodionov
c75bc9b8b0 Change benchmark plugin layout so pytest loads it properly when running
all tests (not necessary performance ones)

resolves #837
2021-11-04 16:33:31 +03:00
Egor Suvorov
33007cc0bb Safekeeper's START_REPLICATION handler: remove stop_point, do not handle start_point == 0 (#777) 2021-11-04 14:50:33 +03:00
Dmitry Rodionov
987833e0b9 Propagate git SHA to zenith binaries
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
2021-11-04 14:22:29 +03:00
Kirill Bulatov
f36acf00de Reduce "relish" word usages in remote storage 2021-11-04 12:53:42 +02:00
Kirill Bulatov
956fc3dec9 Tidy up and make consistent the remote storate API 2021-11-04 12:53:42 +02:00
Heikki Linnakangas
b38e841f2d Use poll() in communication with WAL redo process.
The tokio futures added some overhead, so switch to plain non-blocking
I/O with poll(). In a simple pgbench test on my laptop (select-only
queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10%
improvement, from about 4300 TPS to 4800 TPS.
2021-11-04 10:39:04 +02:00
Heikki Linnakangas
3a0111c75e Refactor functions for constructing WAL redo messages.
Instead of building a separate Vec<u8> to hold each message, serialize all
the messages to one big Vec<u8>. This eliminates some Vec allocation and
memcpy() overhead. The downside is that if there are a lot of records to
replay, we have to serialize them all into one big chunk of memory.
That shouldn't be a problem in practice. If you need to replay millions
of records to reconstruct a page, we should've materialized a new image
of that page earlier already.
2021-11-04 10:39:00 +02:00
Heikki Linnakangas
086a02ab92 Add performance test for simple seq scans.
Fixes https://github.com/zenithdb/zenith/issues/831
2021-11-04 10:36:45 +02:00
Heikki Linnakangas
7ed39655dc Bump vendor/postgres 2021-11-04 10:35:50 +02:00
Dmitry Rodionov
c6172dae47 implement performance tests against our staging environment
tests are based on self-hosted runner which is physically close
to our staging deployment in aws, currently tests consist of
various configurations of pgbenchi runs.

Also these changes rework benchmark fixture by removing globals and
allowing to collect reports with desired metrics and dump them to json
for further analysis. This is also applicable to usual performance tests
which use local zenith binaries.
2021-11-04 02:15:46 +03:00
Heikki Linnakangas
4ba783d0af Remove a couple of unused functions.
We might want to have custom serialize/deserialize functions for
WALRecords and PageVersions for performance reasons, see github issue 832.
But that would probably look a bit different from this, and currently
these functions are just dead.
2021-11-03 19:10:23 +02:00
Patrick Insinger
0457fe81a9 pageserver - make PageVersion an enum 2021-11-03 09:28:49 -07:00
Heikki Linnakangas
fb524dd973 Put a global limit on memory used by in-memory layers.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.

After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.

This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.

Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
2021-11-02 15:49:39 +02:00
Heikki Linnakangas
8c6d2664c0 Support removing arbitrary open layers, not just the oldest one 2021-11-02 15:43:16 +02:00
Patrick Insinger
cdbbd15eb9 pageserver - add InMemoryLayer global map (#817) 2021-11-01 12:20:24 -07:00
anastasia
85f8bf97f5 Name walkeeper threads to make debugging more convenient 2021-11-01 19:09:57 +03:00
anastasia
83ed930bc2 WIP. Launch and shutdown tenant threads together with walreceiver.
TODO: now walreceiver only disconnects if safekeeper was shut down. Implemnt proper walreceiver disconnection.
2021-11-01 18:04:00 +03:00
anastasia
071e30cc53 Expose TENANT_THREADS_COUNT metric to observe number of currently active checkpointer and GC threads 2021-11-01 18:04:00 +03:00
Kirill Bulatov
e6ef27637b Better API to handle timeline metadata properly 2021-10-29 23:51:40 +03:00
Patrick Insinger
b532470792 Set SO_REUSEADDR for all TCP listeners 2021-10-29 12:45:26 -07:00
Heikki Linnakangas
e0d7ecf91c Refactor 'zenith' CLI subcommand handling
Also fixes 'zenith safekeeper restart -m immediate'. The stop-mode was
previously ignored.
2021-10-29 19:01:01 +03:00
Kirill Bulatov
edba2e9744 Use a proper extension for the readme file 2021-10-28 18:55:14 +03:00
Egor Suvorov
7e552b645f Add disk write/sync metrics to Safekeeper (#745) 2021-10-28 18:38:36 +03:00
anastasia
ea5900f155 Refactoring of checkpointer and GC.
Move them to a separate tenant_threads module to detangle thread management from LayeredRepository implementation.
2021-10-27 20:50:26 +03:00
anastasia
28ab40c8b7 fix init_repo() call in register_relish_download() 2021-10-27 20:50:26 +03:00
Alexey Kondratov
d423142623 Proxy: wait for kick on .pgpass connection (zenithdb/console#227) 2021-10-27 20:24:23 +03:00
Dmitry Rodionov
1c0e85f9a0 review cleanups 2021-10-27 13:30:34 +03:00
Dmitry Rodionov
5bc09074ea add a flag to avoid non incremental size calculation in pageserver http api
This calculation is not that heavy but it is needed only in tests, and
in case the number of tenants/timelines is high the calculation can take
noticeable time.

Resolves https://github.com/zenithdb/zenith/issues/804
2021-10-27 13:30:34 +03:00
Heikki Linnakangas
1fac4a3c91 Fix a few messages.
Pointed out by Egor in https://github.com/zenithdb/zenith/pull/788,
but I accidentally pushed that before fixing these.
2021-10-27 10:58:21 +03:00
Heikki Linnakangas
1bc917324d Use -m immediate for 'immediate' shutdown 2021-10-27 10:49:38 +03:00
Heikki Linnakangas
af429fb401 Improve 'zenith' CLI utility for safekeepers and a config file.
The 'zenith' CLI utility can now be used to launch safekeepers. By
default, one safekeeper is configured. There are new 'safekeeper
start/stop' subcommands to manage the safekeepers. Each safekeeper is
given a name that can be used to identify the safekeeper to start/stop
with the 'zenith start/stop' commands. The safekeeper data is stored
in '.zenith/safekeepers/<name>'.

The 'zenith start' command now starts the pageserver and also all
safekeepers. 'zenith stop' stops pageserver, all safekeepers, and all
postgres nodes.

Introduce new 'zenith pageserver start/stop' subcommands for
starting/stopping just the page server.

The biggest change here is to the 'zenith init' command. This adds a
new 'zenith init --config=<path to toml file>' option. It takes a toml
config file that describes the environment. In the config file, you
can specify options for the pageserver, like the pg and http ports,
and authentication. For each safekeeper, you can define a name and the
pg and http ports. If you don't use the --config option, you get a
default configuration with a pageserver and one safekeeper. Note that
that's different from the previous default of no safekeepers.  Any
fields that are omitted in the configuration file are filled with
defaults. You can also specify the initial tenant ID in the config
file. A couple of sample config files are added in the control_plane/
directory.

The --pageserver-pg-port, --pageserver-http-port, and
--pageserver-auth options to 'zenith init' are removed. Use a config
file instead.

Finally, change the python test fixtures to use the new 'zenith'
commands and the config file to describe the environment.
2021-10-27 10:49:38 +03:00
Heikki Linnakangas
710fe02d0b Return success on 'zenith stop' if the page server is already stopped. 2021-10-27 01:10:24 +03:00
Heikki Linnakangas
de87aad990 Remove a few unused functions 2021-10-27 01:10:24 +03:00
Heikki Linnakangas
41d48719e1 In python tests, skip ports that are already in use.
We've seen some failures with "Address already in use" errors in the
tests. It's not clear why, perhaps some server processes are not cleaned
up properly after test, or maybe the socket is still in TIME_WAIT state.
In any case, let's make the tests more robust by checking that the port
is free, before trying to use it.
2021-10-27 00:46:24 +03:00
Kirill Bulatov
d88377f9f0 Remove log from zenith_utils 2021-10-26 23:24:11 +03:00
Kirill Bulatov
ecd577c934 Simplify tracing declarations 2021-10-26 23:24:11 +03:00
anastasia
f43f8401ee Don't wait for wal-redo process for non-relational records replay 2021-10-26 19:30:28 +03:00
Arseny Sher
1877bbc7cb bump vendor/postgres to fix reconnection busy loop 2021-10-26 15:43:19 +03:00
Heikki Linnakangas
a064ebb64c Cope with missing 'tenantid' in '.zenith/config' file.
We generate the initial tenantid and store it in the file, so it shouldn't
be missing. But let's cope with it. (This comes handy with the bigger
changes I'm working on at https://github.com/zenithdb/zenith/pull/788)
2021-10-25 21:24:11 +03:00
Heikki Linnakangas
4726870e8d Remove obsolete comment.
We store the pageserver port in the .zenith/config file.
2021-10-25 21:16:58 +03:00
Heikki Linnakangas
3bbc106c70 Prefer long CLI option name for clarity. 2021-10-25 21:16:58 +03:00
Heikki Linnakangas
66eb081876 Improve comment on 'base_dir' 2021-10-25 21:16:58 +03:00
Kirill Bulatov
f291ab2b87 Do not panic on missing tenant 2021-10-25 18:36:30 +03:00
Heikki Linnakangas
66ec135676 Refactor pytest fixtures
Instead of having a lot of separate fixtures for setting up the page
server, the compute nodes, the safekeepers etc., have one big ZenithEnv
object that encapsulates the whole environment. Every test either uses
a shared "zenith_simple_env" fixture, which contains the default setup
of a pageserver with no authentication, and no safekeepers. Tests that
want to use safekeepers or authentication set up a custom test-specific
ZenithEnv fixture.

Gathering information about the whole environment into one object makes
some things simpler. For example, when a new compute node is created,
you no longer need to pass the 'wal_acceptors' connection string as
argument to the 'postgres.create_start' function. The 'create_start'
function fetches that information directly from the ZenithEnv object.
2021-10-25 14:14:47 +03:00
Heikki Linnakangas
28af3e5008 Remove some unnecessary fixture arguments 2021-10-25 14:14:45 +03:00
Heikki Linnakangas
f337d73a6c Rearrange output dirs a bit
Each test now gets its own test output directory, like
'test_output/test_foobar', even when TEST_SHARED_FIXTURES is used.
When TEST_SHARED_FIXTURES is not used, the zenith repo for each test
is created under a 'repo' subdir inside the test output dir, e.g.
'test_output/test_foobar/repo'
2021-10-25 14:14:43 +03:00
Heikki Linnakangas
57ce541521 Remove unnecessary 'pg_bin' object from 'postgres' fixture.
It was only used in check_restored_datadir_content(), and that function
can construct it easily from the other information it has.
2021-10-25 14:14:41 +03:00
Heikki Linnakangas
e14f24034f Turn a few path-fixtures to global variables
This way, they're readily accessible from the classes and functions
that are not themselves fixtures
2021-10-25 14:14:38 +03:00
Kirill Bulatov
04fb0a0342 Add core relish backup and restore functionality 2021-10-22 22:22:38 +03:00
Heikki Linnakangas
8c42dcc041 Fix safekeeper -D option.
The -D option to specify working directory was broken:

    $ mkdir foobar
    $ ./target/debug/safekeeper -D foobar
    Error: failed to open "foobar/safekeeper.log"

    Caused by:
        No such file or directory (os error 2)

This was because we both chdir'd into to specified directory, and also
prepended the directory to all the paths. So in the above example, it
actually tried to create the log file in "foobar/foobar/safekepeer.log"
Change it to work the same way as in the pageserver: chdir to the
specified directory, and leave 'workdir' always set to ".".

We wouldn't necessarily need the 'workdir' variable in the config at all,
and could assume that the current working directory is always the
safekeeper data directory, but I'd like to keep this consistent with the
the pageserver. The page server doesn't assume that for the sake of unit
tests. We don't currently have unit tests in the safekeeper that write
to disk but we might want to in the future.
2021-10-22 08:39:58 +03:00
Alexey Kondratov
9070a4dc02 Turn off back pressure by default 2021-10-22 01:40:43 +03:00
Egor Suvorov
86a28458c6 test_runner: use Python 3.7 in CI and improve its support (#775)
* We actually need Python 3.7 because of dataclasses
* Rerun 'pipenv lock' under Python 3.7 and add 'pipenv' to dev deps
* Update docs on developing for Python 3.7
* CircleCI: use Python 3.7 via Docker image instead of Orb
2021-10-21 20:01:29 +03:00
Egor Suvorov
c058d04250 Rename WalAcceptor to Safekeeper in most places (#741) 2021-10-21 18:26:43 +03:00
Konstantin Knizhnik
c310932121 Implement backpressure for compute node to avoid WAL overflow
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
2021-10-21 18:15:50 +03:00
Egor Suvorov
ff563ff080 test_runner: fix mypy errors and force it on CI (#774)
* Fix bugs found by mypy
* Add some missing types and runtime checks, remove unused code
* Make ZenithPageserver start right away for better type safety
* Add `types-*` packages to Pipfile
* Pin mypy version and run it on CircleCI
2021-10-21 13:51:54 +03:00
anastasia
7f9d2a7d05 Change 'zenith tenant list' API to return tenant state added in 0dc7a3fc 2021-10-21 11:04:22 +03:00
Arthur Petukhovsky
13f4e173c9 Wait for safekeepers to catch up in test_restarts_under_load (#776) 2021-10-20 14:42:53 +03:00
Dmitry Ivanov
85116a8375 [proxy] Prevent TLS stream from hanging
This change causes writer halves of a TLS stream to always flush after a
portion of bytes has been written by `std::io::copy`. Furthermore, some
cosmetic and minor functional changes are made to facilitate debug.
2021-10-20 14:15:49 +03:00
Egor Suvorov
e42c884c2b test_runner/README: add note on capturing logs (#778)
Became actual after #674
2021-10-20 01:55:49 +03:00
Egor Suvorov
eb706bc9f4 Force yapf (Python code formatter) in CI (#772)
* Add yapf run to CircleCI
* Pin yapf version
* Enable `SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES` setting
* Reformat all existing code with slight manual adjustments
* test_runner/README: note that yapf is forced
2021-10-19 20:13:47 +03:00
Dmitry Rodionov
798df756de suppress FileNotFound exception instead of missing_ok=True because the latter is added in python 3.8 and we claim to support >3.6 2021-10-19 17:13:42 +03:00
Dmitry Rodionov
732d13fe06 use cached-property package because python<3.8 doesnt have cached_property in functools 2021-10-19 17:13:42 +03:00
Heikki Linnakangas
feae7f39c1 Support read-only nodes
Change 'zenith.signal' file to a human-readable format, similar to
backup_label. It can contain a "PREV LSN: %X/%X" line, or a special
value to indicate that it's OK to start with invalid LSN ('none'), or
that it's a read-only node and generating WAL is forbidden
('invalid').

The 'zenith pg create' and 'zenith pg start' commands now take a node
name parameter, separate from the branch name. If the node name is not
given, it defaults to the branch name, so this doesn't break existing
scripts.

If you pass "foo@<lsn>" as the branch name, a read-only node anchored
at that LSN is created. The anchoring is performed by setting the
'recovery_target_lsn' option in the postgresql.conf file, and putting
the server into standby mode with 'standby.signal'.

We no longer store the synthetic checkpoint record in the WAL segment.
The postgres startup code has been changed to use the copy of the
checkpoint record in the pg_control file, when starting in zenith
mode.
2021-10-19 09:48:12 +03:00
Heikki Linnakangas
c2b468c958 Separate node name from the branch name in ComputeControlPlane
This is in preparation for supporting read-only nodes. You can launch
multiple read-only nodes on the same brach, so we need an identifier
for each node, separate from the branch name.
2021-10-19 09:48:10 +03:00
Heikki Linnakangas
e272a380b4 On new repo, start writing WAL only after the initial checkpoint record.
Previously, the first WAL record on the 'main' branch overwrote the
initial checkpoint record, with invalid 'xl_prev'. That's harmless, but
also pretty ugly. I bumped into this while I was trying to tighen up the
checks for when a valid 'prev_lsn' is required. With this patch, the
first WAL record gets a valid 'xl_prev' value. It doesn't matter much
currently, but let's be tidy.
2021-10-19 09:48:04 +03:00
anastasia
0dc7a3fc15 Change tenant_mgr to use TenantState.
It allows to avoid locking entire TENANTS list while one tenant is bootstrapping
and prepares the code for remote storage integration.
2021-10-18 15:40:06 +03:00
Egor Suvorov
a1bc0ada59 Dockerfile: remove wal_acceptor alias for safekeeper (#743) 2021-10-18 14:56:30 +03:00
Kirill Bulatov
e9b5224a8a Fix toml serde gotchas 2021-10-18 14:14:27 +03:00
Heikki Linnakangas
bdd039a9ee S3 DELETE call returns 204, not 200.
According to the S3 API docs, the DELETE call returns code "204 No content"
on success.
2021-10-17 16:21:58 +03:00
Heikki Linnakangas
b405eef324 Avoid writing the metadata file when it hasn't changed. 2021-10-17 14:54:39 +03:00
Kirill Bulatov
ba557d126b React on sigint 2021-10-15 21:24:24 +03:00
Patrick Insinger
2dde20a227 Bump MSRV to 1.55 2021-10-15 09:10:08 -07:00
Kirill Bulatov
4ade0bb41c Refactor upload/download_relish function signatures.
This makes them more generic, by taking any Read / Write trait
implementation, instead of operating directly on a a file.
2021-10-15 11:34:15 +03:00
Stas Kelvich
100da024b6 expose pageserver http socket in docker 2021-10-15 00:26:38 +03:00
Arseny Sher
de744a44dd Add /timeline http request to safekeeper returning its status.
Which is mainly generational state (terms) and useful LSNs.

Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes #726.

ref #115
2021-10-14 19:02:38 +03:00
Heikki Linnakangas
0e026371ec Optimize WAL decoding slightly.
This adds a fast-path for the common case that the record doesn't
cross a page boundary. We now split off a new Bytes directly from the
original input buffer in that case, instead of copying the record to a
new BytesMut. Shaves about 5% of the page server's CPU time on my
laptop, in the 'test_bulk_insert' test.
2021-10-14 14:21:23 +03:00
Arthur Petukhovsky
4b87acb1f6 Use logging in python tests (#674)
* Use logging in python tests

* Use f-strings for logs

* Don't log test output while running

* Use only pytest logging handler

* Add more info about pytest logging
2021-10-14 13:10:09 +03:00
Dmitry Ivanov
43957f4401 [cross-repo-ci] Use solely commit hash to test PRs in CI
See #744 for the discussion.
2021-10-13 17:16:02 +03:00
Heikki Linnakangas
8a4f092e82 Skip syncing the temp initdb installation.
Doesn't make much difference on my laptop with SSD, but every little
helps, and with a slower disk it might be noticeable.
2021-10-13 16:59:00 +03:00
Egor Suvorov
6b6b3f68be Safekeeper metrics refactor (#747) 2021-10-13 16:28:24 +03:00
Arseny Sher
96f1175a80 Cleanup hardcoded oids. 2021-10-13 10:52:47 +03:00
Patrick Insinger
1c29de81de pageserver - remove lsn from WALRecord 2021-10-13 00:03:42 -07:00
Egor Suvorov
f658263543 Revert "Dockerfile: remove wal_acceptor alias for safekeeper"
This reverts commit 64ca947722.
2021-10-12 19:05:58 +00:00
Egor Suvorov
64ca947722 Dockerfile: remove wal_acceptor alias for safekeeper 2021-10-12 19:05:16 +00:00
Egor Suvorov
23f4c0a742 Rename wal_acceptor binary to safekeeper (#740), stage 1/2
* Rename wal_acceptor binary to safekeeper
* Rename wal_acceptor.pid and wal_acceptor.log to safekeeper.pid and safekeeper.log
* Change some mentions of WAL acceptor to safekeeper
* Dockerfile: alias wal_acceptor to safekeeper temporarily until internal scripts are updated
2021-10-12 22:03:06 +03:00
Dmitry Ivanov
7c5b99683c Speed up builds by passing make jobserver to cargo
This change brings the following improvements to our build system:

* Now BUILD_TYPE also affects rust apps.
* From now on, cargo will respect `-jN` passed via `make`. However, note
  that `rustc` may spawn multiple threads depending on compile flags.
* Cargo is able to cooperate with make to better schedule parallel jobs,
  which leads to better build times (-20s in release mode on my machine).
2021-10-12 21:02:39 +03:00
Patrick Insinger
160c4aff61 pageserver - use write guard for checkpointing 2021-10-12 10:02:15 -07:00
Patrick Insinger
6e5ca5dc5c pageserver - create TimelineWriter 2021-10-12 10:02:15 -07:00
Egor Suvorov
f3445949d1 Wal acceptor: report socket bind errors better when daemonizing (#738)
Fixes #664
2021-10-12 16:51:28 +03:00
Heikki Linnakangas
95a85312f5 Simplify code to build walredo messages.
No need to use BytesMut in these functions. Plain Vec is simpler. And
should be marginally faster too; I saw BytesMut functions previously
in 'perf' profile, consuming around 5% of the overall pageserver CPU
time. That's gone with this patch, although I don't see any discernible
difference in the overall performance test results.
2021-10-12 10:16:26 +03:00
Heikki Linnakangas
934fb8592f Detect when a checkpoint is modified in a smarter way.
Previously, the WAL receiver we would make a decoded copy of the current
Checkpoint before each WAL record, and compare it with the Checkpoint
after the record has been processed. If it has changed, the checkpoint
relish is updated in the repository. That's somewhat expensive, the
Checkpoint::encode() function is visible in 'perf' profile. Change that
so that we set a flag whenever the Checkpoint struct is modified, so that
we dont need to compare the whole struct anymore.
2021-10-12 09:09:10 +03:00
Dmitry Ivanov
bb239b4f69 [Makefile] Set default build type to debug 2021-10-11 17:08:31 +03:00
Dmitry Ivanov
1cd7900790 [Makefile] Make build type detection more precise
Previously, typos like `BUILD_TYPE=rlease` would silently
lead to building debug binaries. The current approach is also
more future-proof, since we might add `profile`, `valgrind`
as well as other build types.
2021-10-11 17:03:51 +03:00
Arseny Sher
8c61c3e54e Minor safekeeper readme fix. 2021-10-11 16:31:44 +03:00
anastasia
d7c9dd06f4 Implement graceful shutdown at 'pageserver stop':
- perform checkpoint for each tenant repository.
- wait for the completion of all threads.

Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.
2021-10-11 13:35:01 +03:00
Heikki Linnakangas
b9119f11bf Add perf test case for buffering GiST build.
When a WAL record affects multiple pages, we currently duplicate the
record for each affected page. That's a bit wasteful, but not too bad
for b-tree splits and non-hot heap updates that affect two pages. But
buffering GiST index build WAL-logs the whole relation in 32 page chunks,
with one giant WAL record for each 32-page chunk. Currently we duplicate
that giant record for each of the 32 pages, which is really wasteful.

Github issue https://github.com/zenithdb/zenith/issues/720 tracks the
problem. This commit adds a test case for it to demonstrate it.
2021-10-11 11:10:58 +03:00
Heikki Linnakangas
7216f22609 Use tracing crate to have more context in log messages.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.

This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.

Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.

We now obey the RUST_LOG env variable, if it's set.

The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing.  For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
2021-10-11 08:59:06 +03:00
Kirill Bulatov
bf58f7f649 Expose certain layered repository structs to reuse in relish storage (#688) 2021-10-09 19:23:57 +03:00
Patrick Insinger
3f0ebc6a40 pageserver - move early File::open call 2021-10-09 08:45:52 -07:00
Patrick Insinger
0baf4bc796 fix cargo doc complaints 2021-10-09 08:45:46 -07:00
Patrick Insinger
c356030660 pageserver - use VecMap for delta metadata & sizes 2021-10-08 15:05:22 -07:00
Patrick Insinger
c4bb6d78d4 pageserver - use VecMap for in memory segsizes 2021-10-08 14:37:32 -07:00
Patrick Insinger
3b82e806f2 pageserver - use VecMap for in-memory PageVersions 2021-10-08 14:11:07 -07:00
Egor Suvorov
403d9779d9 safekeeper: add initial metrics and HTTP handler (#699, #541)
* `wal_acceptor`: add HTTP handler, /metrics endpoint only, no authentication
* Two gauges are currently reported: `flush_lsn` and `commit_lsn`
* Add `DEFAULT_PG_LISTEN_PORT` and `DEFAULT_PG_LISTEN_PORT` consts for uniformity
2021-10-08 18:55:41 +03:00
Patrick Insinger
b3b8f18f61 tests - fix get_timeline_size signature 2021-10-07 15:38:22 -07:00
Heikki Linnakangas
960c7d69a8 Remove 'predecessor' reference from in-memory and delta layers.
The caller is now responsible for lookin up the predecessor layer,
instead. This makes the code simpler, as you don't need to update the
predecessor reference when a layer is frozen or written to disk.

There was a bug in that, as Konstantin noted on discord:

    Assume that freeze doesn't create new inmem layer
    (maybe_new_open=None). Then we temporary place in historics frozen
    layer. Assume that now new put_wal_record request arrives. There is
    no open in-mem layer, so it has to create new one. It is looking for
    previous layer for read and set it as new in-mem layer
    predecessor. But as far as I understand, prev layer should be our
    temporary frozen layer. Which will be then removed from
    historics.

That leaves the predecessor field of the new in-memory layer pointing
at the frozen in-memory layer that has been removed from the layer map,
preventing it from being removed from memory.

This makes two subtle changes:

1. When the first new layer is created on a branch for a segment that
   existed on the ancestor branch, the start_lsn of the new layer is now
   the branch point + 1. We were previously slightly confused on what
   the branch point LSN meant. It means that all the WAL up to and
   *including* the LSN on the old branch is visible to the new branch.
   If we mark the start LSN of the new layer as equal to the branch point,
   that's wrong, because if there is a WAL record with that LSN on the
   predecessor layer, the new layer would hide it. This bug was hidden
   when the layer on the new branch contained a direct reference to the
   layer in the old branch, as get_page_reconstruct_data() followed that
   reference directly when it didn't find the page version in the new
   layer. But now that the caller performs the lookup, it will look up
   the new layer that doesn't contain the record, and you get an error.

2. InMemoryLayer now always stores the segment size at the beginning
   of the layer's LSN range. Previously, get_seg_size() might have
   recursed into the predecessor layer to get the size, but now we
   avoid that by always copying over the last size from the previous
   layer, when a new layer is created.
2021-10-08 00:54:13 +03:00
Heikki Linnakangas
60dae0b4ac Add test case that demonstrates Write Amplification. 2021-10-08 00:34:29 +03:00
Heikki Linnakangas
c660926a06 Refactor duplicated code to get on-disk timeline size in tests.
Move it to a common function. In the passing, remove the obsolete check
to exclude the 'wal' directory. The 'wal' directory is no more.
2021-10-08 00:34:26 +03:00
Egor Suvorov
7fa04e2d14 zenith_metrics: exit process on config errors (#706) 2021-10-08 00:14:56 +03:00
Heikki Linnakangas
db4059cd6d Measure peak memory usage in perf test.
Another useful metric to keep an eye on.
2021-10-07 18:03:20 +03:00
Heikki Linnakangas
fdb19fdb92 Remove unused function.
The caller was removed in commit acc0f41985.
2021-10-07 11:24:27 +03:00
Heikki Linnakangas
53b4dc944d Don't create unused "wal" directory
It hasn't been used since commit ca9af37478.
2021-10-07 10:36:26 +03:00
MMeent
a03e1b3895 Docker build now also uses BUILD_TYPE=release. (#712)
The dockerignore and dockerfile have also been excluded from being moved into
docker images, saving docker layer cache busts if only those are changed.
2021-10-06 23:42:00 +02:00
Heikki Linnakangas
15f1bcc9c2 Remove obsolete code, now that we don't load WAL from local disk anymore.
Commit ca9af37478 removed the import_timeline_wal() call from here.
After that, the info!() message is bogus, as we no longer load the WAL
from local disk. Also, the logical size assertion is pointless now.
2021-10-06 15:59:28 +03:00
MMeent
24580f2493 Improve build system: (#703)
- Build postgresql with -O2 for releases
 - Make make make postgresql with 8 parallel threads
   The node is xlarge, so it has 8 vCPU available
2021-10-06 14:37:27 +02:00
Heikki Linnakangas
e3945d94fd Store unlogged tables locally, and replace PD_WAL_LOGGED.
All the changes are in the vendor/postgres side. However, because we now
generate fewer Full Page Writes, the 'branch_behind' test needs to be
modified so that it still generates enough WAL to consume a few WAL
segments.
2021-10-06 10:58:15 +03:00
Heikki Linnakangas
d806c3a47e pageserver - serialize PageVersion as it is
Removes the need for PageVersionMeta struct.
2021-10-05 11:07:50 -07:00
Egor Suvorov
05fe39088b Readme updates based on a fresher Ubuntu installation experience (#627) 2021-10-05 19:19:25 +03:00
Egor Suvorov
530d3eaf09 Add more details to pageserver and safekeeper docs (#680) 2021-10-05 19:10:50 +03:00
Egor Suvorov
7e190d72a5 Make pageserver_ prefix for common metric names configurable (#681) 2021-10-05 19:06:44 +03:00
Patrick Insinger
9c936034b6 pageserver - fix newer clippy lints 2021-10-05 00:28:14 -07:00
Kirill Bulatov
5719f13cb2 Rework the relish thread model (#689) 2021-10-05 10:15:56 +03:00
Patrick Insinger
d134a9856e pageserver - introduce RepoHarness for testing 2021-10-04 08:36:35 -07:00
Patrick Insinger
664b99b5ac pageserver - use constant TIMELINE_ID for tests 2021-10-04 08:36:35 -07:00
Arseny Sher
4256231eb7 Enable test_start_compute with safekeepers.
It should work now.
2021-10-04 16:50:46 +03:00
Andrey Taranik
ae27490281 wal_acceptors added to tenant creation tests 2021-10-04 08:58:49 +03:00
Andrey Taranik
fbd8ca2ff4 minor code beautification 2021-10-04 08:58:49 +03:00
Andrey Taranik
ec673a5d67 bulk tenant create test added 2021-10-04 08:58:49 +03:00
Max Sharnoff
7fab38c51e Use threadlocal for walreceiver check (#692) 2021-10-01 15:47:45 -07:00
Max Sharnoff
84f7dcd052 Fix clippy errors on nightly (2021-09-29) (#691)
Most of the changes are for the new if-then-panic lint added in
https://github.com/rust-lang/rust-clippy/pull/7669.
2021-10-01 15:45:42 -07:00
Patrick Insinger
7095a5d551 pageserver - reject and backup future layer files
If a layer file is found with LSN after the disk_consistent_lsn, it is
renamed (to avoid conflicts with new layer files) and a warning is logged.
2021-10-01 11:41:39 -07:00
Patrick Insinger
538c2a2a3e pageserver - store timeline metadata durably
The metadata file is now always 512 bytes. The last 4 bytes are a
crc32c checksum of the previous 508 bytes. Padding zeroes are added
between the serde serialization and the start of the checksum.

A single write call is used, and the file is fsyncd after.
On file creation, the parent directory is fsyncd as well.
2021-10-01 11:41:39 -07:00
Patrick Insinger
62f83869f1 pageserver - fsync image/delta layers
Ensure image and delta layer files are durable.
Also, fsync the parent directory to ensure the directory entries are
durable.
2021-10-01 11:41:39 -07:00
Patrick Insinger
69670b61c4 pageserver - use crashsafe_dir utility
Replace usage of std::fs::create_dir/create_dir_all with crashsafe
equivalents.
2021-10-01 11:41:39 -07:00
Patrick Insinger
0a8aaa2c24 zenith_utils - add crashsafe_dir
Utility for creating directories and directory trees in a crash safe
manor.

Minimizes calls to fsync for trees.
2021-10-01 11:41:39 -07:00
Heikki Linnakangas
e474790400 Print more details on errors to log
Fixes https://github.com/zenithdb/zenith/issues/661
2021-10-01 17:57:41 +03:00
Alexey Kondratov
2c99e2461a Allow usage of the compute hostname in the proxy 2021-10-01 16:24:35 +03:00
Stas Kelvich
cf8e27a554 Proxy: pass database name in console too 2021-10-01 14:27:52 +03:00
Kirill Bulatov
287ea2e5e3 Limit concurrent relish storage sync operations 2021-10-01 08:37:09 +03:00
Heikki Linnakangas
86e14f2f1a Bump vendor/postgres 2021-09-30 20:36:57 +03:00
Arseny Sher
adbae62281 Rename SharedState.commit_lsn to notified_commit_lsn.
ref #682
2021-09-30 17:29:15 +03:00
Egor Suvorov
3127a4a13b Safekeeper::Storage::write_wal: clarify behavior (#679)
It previously took &SafeKeeperState similar to persist(), but only for its
`server` member.
Now it takes &ServerInfo only, so there it's clear the state is not persisted.
Also added a comment about sync.
2021-09-29 19:58:30 +03:00
Egor Suvorov
6d993410c9 docs/README: fix link to walkeeper's README (#677) 2021-09-29 14:40:16 +03:00
Kirill Bulatov
fb05e4cb0b Show better error messages on pageserver failures 2021-09-29 01:55:41 +03:00
Egor Suvorov
b0a7234759 pageserver: fix stale default listen addrs
* In command line help
* In dummy_conf
2021-09-28 20:57:51 +03:00
Egor Suvorov
ddf4b15ebc pageserver: use const_format crate to generate default listen addrs 2021-09-28 20:57:51 +03:00
Egor Suvorov
3065532f15 pageserver: fix mistype in listen-http arg help 2021-09-28 20:57:51 +03:00
Arthur Petukhovsky
d6fc74a412 Various fixes for test_sync_safekeepers (#668)
* Send ProposerGreeting manually in tests

* Move test_sync_safekeepers to test_wal_acceptor.py

* Capture test_sync_safekeepers output

* Add comment for handle_json_ctrl

* Save captured output in CI
2021-09-28 19:25:05 +03:00
Arseny Sher
7a370394a7 Wait till previous victim recovers in run_restarts_under_load.
Fixes test flakiness, as recovery easily might take the whole iteration.
2021-09-28 19:15:41 +03:00
Stas Kelvich
0f3cf8ac94 Cleanup Dockerfile.
* make .dockerignore `ncdu -X` compatible to easily inspect build context
* remove cargo-chef as it was introducing more problems than it was solving
* remove rocksdb packages
* add ca-certs in the resulting image. We need that to be able to make https
  connections from container with proxy to the console.
2021-09-28 18:26:20 +03:00
Heikki Linnakangas
014be8b230 Use Iterator, to avoid making one copy of page_versions BTreeMap
Reduces the CPU time spent in checkpointing, in the write_to_disk()
function.
2021-09-27 19:28:02 +03:00
Heikki Linnakangas
08978458be Refactor write_to_disk, handling dropped segment as a special case.
Similar to what commit 7fb7f67b did to 'freeze', dealing with the
dropped segment separately from the rest of the logic makes the code
easier to follow. It is also needed by the next commit that replaces
the code to build new BTreeMap with an iterator; we cannot pass one
of two kinds of closures as argument, it has to always be the same one.
Having separate DeltaLayer::create() calls for the case of dropped
segment and the other cases works around that.
2021-09-27 19:23:32 +03:00
Heikki Linnakangas
2252d9faa8 Switch to RwLock in InMemoryLayer
Allows more parallelism basically for free.
2021-09-27 19:15:40 +03:00
Arthur Petukhovsky
22e15844ae Fix clippy errors (#673) 2021-09-27 18:59:30 +03:00
Konstantin Knizhnik
ca9af37478 Do not write WAL at pageserver (#645)
* Do not write WAL at pageserver

* Remove import_timeline_wal function
2021-09-27 14:15:55 +03:00
Stas Kelvich
aae41e8661 Proxy pass for existing users.
Ask console to check per-cluster auth info.
2021-09-27 11:56:43 +03:00
Stas Kelvich
8331ce865c Interceipt and log error in mgmt interface.
That PostgresBackend is better be replaced with the http server or redis
subscription. For now let's improve logging and move on.
2021-09-27 11:56:43 +03:00
Stas Kelvich
3bac4d485d Fix EncryptionResponse message in pq_proto.rs
Positive EncryptionResponse should set 'S' byte, not 'Y'. With that
fix it is possible to connect to proxy with SSL enabled and read
deciphered notice text. But after the first query everything stucks.
2021-09-27 11:56:43 +03:00
Stas Kelvich
f84eaf4f05 Leave only pkcs8 keys support for proxy.
rsa_private_keys() function returns an empty vector when tries to read
pkcs8-encoded file instead of returning an error. So previous check was
failing on pkcs8. Leave only pkcs8 for now.
2021-09-27 11:56:43 +03:00
Arseny Sher
70b08923ed Disable new safekeepers tests as not stable enough. 2021-09-26 22:33:58 +03:00
Heikki Linnakangas
c846a824de Bump vendor/postgres, to use buffered I/O in WAL redo process.
Greatly reduces the CPU overhead in the WAL redo process.
2021-09-24 21:48:30 +03:00
Heikki Linnakangas
b71e3a40e2 Add more details to the log, when an error happens in GetPage request. 2021-09-24 21:44:22 +03:00
Heikki Linnakangas
41dfc117e7 Buffer the writes to the WAL redo process pipe.
Reduces the CPU time spent in the write() syscalls. I noticed that we were
spending a lot of CPU time in libc::write, coming from request_redo(), in
the 'bulk_insert' test. According to some quick profiling with 'perf',
this reduces the CPU time spent in request_redo() from about 30% to 15%.

For some reason, it doesn't reduce the overall runtime of the 'bulk_insert'
test much, maybe by one second if you squint (from about 37s to 36s), so
there must be some other bottleneck, like I/O. But this is surely still
a good idea, just based on the reduced CPU cycles.
2021-09-24 21:12:38 +03:00
sharnoff
a72707b8cb Redo #655 with fix: Allow LeSer/BeSer impls missing either Serialize or Deserialize
Commit message copied below:

* Allow LeSer/BeSer impls missing Serialize/Deserialize

Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.

Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.

* Remove unused #[derive(Serialize/Deserialize)]

This should hopefully reduce compile times - if only by a little bit.

Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
2021-09-24 10:58:01 -07:00
Max Sharnoff
0f770967b4 Revert "Allow LeSer/BeSer impls missing either Serialize or Deserialize (#655)
This reverts commit bd9f4794d9.
2021-09-24 10:18:36 -07:00
Max Sharnoff
bd9f4794d9 Allow LeSer/BeSer impls missing either Serialize or Deserialize (#655)
* Allow LeSer/BeSer impls missing Serialize/Deserialize

Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.

Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.

* Remove unused #[derive(Serialize/Deserialize)]

This should hopefully reduce compile times - if only by a little bit.

Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
2021-09-24 10:06:03 -07:00
Heikki Linnakangas
ff5cbe2694 Support overlapping and nested Layers in the layer map.
This introduces a new tree data structure for holding intervals, and
queries of the form "which intervals contain the given point?". It then
uses that to store the Layers in the layer map, instead of the BTreeMap.

While we don't currently create overlapping layers in the page server,
that situation might arise in the future if we start to create extra
layers for performance purposes, or as part of some multi-stage
garbage collection operation that creates new layers in some interval
and then removes old ones. The situation might also arise if you have
multiple page servers running on the same timeline, freezing layers at
different points, and both uploading them to S3.

So even though overlapping layers might not happen currently, let's
avoid getting confused if it does happen for some reason.

Fixes https://github.com/zenithdb/zenith/issues/517.
2021-09-24 14:10:52 +03:00
Heikki Linnakangas
2319e0ec8f Define a layer's start and end bounds more precisely.
After this, a layer's start bound is always defined to be inclusive, and
end bound exclusive.

For example, if you have a layer in the range 100-200, that layer can be
used for GetPage@LSN requests at LSN 100, 199, or anything in between.
But for LSN 200, you need to look at the next layer (if one exists).

This is one part of a fix for https://github.com/zenithdb/zenith/issues/517.
After this, the page server shouldn't create layers for the same segment
with the same LSN, which avoids the issue. However, the same thing would
still happen, if you managed to create layers with same start LSN again.
That could happen e.g. if you had two page servers running, or in some
weird crash/restart scenario, or due to bugs or features added later. The
next commit makes the layer map more robust, so that it tolerates that
situation without deleting wrong files.
2021-09-24 14:10:49 +03:00
Arthur Petukhovsky
d4e037f1e7 Support for --sync-safekeepers in tests (#647)
New command has been added to append specially crafted records in safekeeper WAL. This command takes json for append, encodes LogicalMessage based on json fields, and processes new AppendRequest to append and commit WAL in safekeeper.

Python test starts up walkeepers and creates config for walproposer, then appends WAL and checks --sync-safekeepers works without errors. This test is simplest one, more useful test cases (like in #545) for different setups will be added soon.
2021-09-24 13:19:59 +03:00
Max Sharnoff
139936197a bump vendor/postgres: Catch walkeeper ErrorResponse (#650)
Postgres commit message:

PQgetCopyData can sometimes indicate that the copy is done if the
backend returns an error response. So while we still expect that the
walkeeper never sends CopyDone, we can't expect it to never produce
errors.
2021-09-23 14:55:38 -07:00
Heikki Linnakangas
d4eed61f57 Refactor code for parsing and creating postgresql.conf.
There's surely more that could be done, but this makes it a bit more
readable at least.
2021-09-23 19:34:27 +03:00
Patrick Insinger
7db3a9e7d9 walredo - don't use RefCell on stdin/stdout 2021-09-23 08:42:58 -07:00
Patrick Insinger
c81ee3bd5b Add some comments to the checkpoint process 2021-09-23 13:19:45 +03:00
anastasia
7fb7f67bb4 Fix relish extention after it was dropped or truncated.
- Turn dropped layers into non-writeable in get_layer_for_write().

- Handle non-writeable dropped layers in checkpointer. They don't need freezing, so just remove them from list of open_segs and write out to disk.

- Remove code that handles dropped layers in freeze() function. It is not used anymore.
2021-09-23 13:19:45 +03:00
anastasia
86164c8b33 Add unit tests for drop_lsn.
test_drop_extend and test_truncate_extend illustrate what happens if we dropped a segment and then created it again within the same layer.
2021-09-23 13:19:45 +03:00
Arseny Sher
97c4cd4434 bump vendor/postgres 2021-09-23 12:22:53 +03:00
anastasia
a4fc6da57b Fix gc_internal to treat dropped layers.
Some dropped layers serve as tombstones for earlier layers and thus cannot be garbage collected.
Add new fields to GcResult for layers that are preserved as tombstones
2021-09-23 12:21:47 +03:00
anastasia
c934e724a8 Enable test_list_rels_drop test 2021-09-23 12:21:47 +03:00
anastasia
e554f9514f gc refactoring
- rename 'compact' argument of GC to 'checkpoint_before_gc'.
- gc_iteration_internal() refactoring
2021-09-23 12:21:47 +03:00
247 changed files with 35862 additions and 10386 deletions

View File

@@ -1,29 +1,34 @@
version: 2.1
orbs:
python: circleci/python@1.4.0
executors:
zenith-build-executor:
zenith-xlarge-executor:
resource_class: xlarge
docker:
- image: cimg/rust:1.52.1
# NB: when changed, do not forget to update rust image tag in all Dockerfiles
- image: zimg/rust:1.56
zenith-executor:
docker:
- image: zimg/rust:1.56
jobs:
check-codestyle:
executor: zenith-build-executor
check-codestyle-rust:
executor: zenith-xlarge-executor
steps:
- checkout
- run:
name: rustfmt
when: always
command: |
cargo fmt --all -- --check
command: cargo fmt --all -- --check
# A job to build postgres
build-postgres:
executor: zenith-build-executor
executor: zenith-xlarge-executor
parameters:
build_type:
type: enum
enum: ["debug", "release"]
environment:
BUILD_TYPE: << parameters.build_type >>
steps:
# Checkout the git repo (circleci doesn't have a flag to enable submodules here)
- checkout
@@ -32,23 +37,13 @@ jobs:
# Note this works even though the submodule hasn't been checkout out yet.
- run:
name: Get postgres cache key
command: |
git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
command: git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
- restore_cache:
name: Restore postgres cache
keys:
# Restore ONLY if the rev key matches exactly
- v03-postgres-cache-{{ checksum "/tmp/cache-key-postgres" }}
# FIXME We could cache our own docker container, instead of installing packages every time.
- run:
name: apt install dependencies
command: |
if [ ! -e tmp_install/bin/postgres ]; then
sudo apt update
sudo apt install build-essential libreadline-dev zlib1g-dev flex bison libseccomp-dev
fi
- v04-postgres-cache-<< parameters.build_type >>-{{ checksum "/tmp/cache-key-postgres" }}
# Build postgres if the restore_cache didn't find a build.
# `make` can't figure out whether the cache is valid, since
@@ -59,29 +54,26 @@ jobs:
if [ ! -e tmp_install/bin/postgres ]; then
# "depth 1" saves some time by not cloning the whole repo
git submodule update --init --depth 1
make postgres
# bail out on any warnings
COPT='-Werror' mold -run make postgres -j$(nproc)
fi
- save_cache:
name: Save postgres cache
key: v03-postgres-cache-{{ checksum "/tmp/cache-key-postgres" }}
key: v04-postgres-cache-<< parameters.build_type >>-{{ checksum "/tmp/cache-key-postgres" }}
paths:
- tmp_install
# A job to build zenith rust code
build-zenith:
executor: zenith-build-executor
executor: zenith-xlarge-executor
parameters:
build_type:
type: enum
enum: ["debug", "release"]
environment:
BUILD_TYPE: << parameters.build_type >>
steps:
- run:
name: apt install dependencies
command: |
sudo apt update
sudo apt install libssl-dev clang
# Checkout the git repo (without submodules)
- checkout
@@ -96,7 +88,7 @@ jobs:
name: Restore postgres cache
keys:
# Restore ONLY if the rev key matches exactly
- v03-postgres-cache-{{ checksum "/tmp/cache-key-postgres" }}
- v04-postgres-cache-<< parameters.build_type >>-{{ checksum "/tmp/cache-key-postgres" }}
- restore_cache:
name: Restore rust cache
@@ -104,25 +96,26 @@ jobs:
# Require an exact match. While an out of date cache might speed up the build,
# there's no way to clean out old packages, so the cache grows every time something
# changes.
- v03-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
- v04-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
# Build the rust code, including test binaries
- run:
name: Rust build << parameters.build_type >>
command: |
export CARGO_INCREMENTAL=0
BUILD_TYPE="<< parameters.build_type >>"
if [[ $BUILD_TYPE == "debug" ]]; then
echo "Build in debug mode"
cargo build --bins --tests
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
CARGO_FLAGS=
elif [[ $BUILD_TYPE == "release" ]]; then
echo "Build in release mode"
cargo build --release --bins --tests
cov_prefix=()
CARGO_FLAGS=--release
fi
export CARGO_INCREMENTAL=0
"${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests
- save_cache:
name: Save rust cache
key: v03-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
key: v04-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
paths:
- ~/.cargo/registry
- ~/.cargo/git
@@ -132,53 +125,115 @@ jobs:
# has to run separately from cargo fmt section
# since needs to run with dependencies
- run:
name: clippy
name: cargo clippy
command: |
./run_clippy.sh
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
"${cov_prefix[@]}" ./run_clippy.sh
# Run rust unit tests
- run: cargo test
- run:
name: cargo test
command: |
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
"${cov_prefix[@]}" cargo test
# Install the rust binaries, for use by test jobs
# `--locked` is required; otherwise, `cargo install` will ignore Cargo.lock.
# FIXME: this is a really silly way to install; maybe we should just output
# a tarball as an artifact? Or a .deb package?
- run:
name: cargo install
name: Install rust binaries
command: |
export CARGO_INCREMENTAL=0
BUILD_TYPE="<< parameters.build_type >>"
if [[ $BUILD_TYPE == "debug" ]]; then
echo "Install debug mode"
CARGO_FLAGS="--debug"
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
echo "Install release mode"
# The default is release mode; there is no --release flag.
CARGO_FLAGS=""
cov_prefix=()
fi
binaries=$(
"${cov_prefix[@]}" cargo metadata --format-version=1 --no-deps |
jq -r '.packages[].targets[] | select(.kind | index("bin")) | .name'
)
test_exe_paths=$(
"${cov_prefix[@]}" cargo test --message-format=json --no-run |
jq -r '.executable | select(. != null)'
)
mkdir -p /tmp/zenith/bin
mkdir -p /tmp/zenith/test_bin
mkdir -p /tmp/zenith/etc
# Install target binaries
for bin in $binaries; do
SRC=target/$BUILD_TYPE/$bin
DST=/tmp/zenith/bin/$bin
cp $SRC $DST
echo $DST >> /tmp/zenith/etc/binaries.list
done
# Install test executables (for code coverage)
if [[ $BUILD_TYPE == "debug" ]]; then
for bin in $test_exe_paths; do
SRC=$bin
DST=/tmp/zenith/test_bin/$(basename $bin)
cp $SRC $DST
echo $DST >> /tmp/zenith/etc/binaries.list
done
fi
cargo install $CARGO_FLAGS --locked --root /tmp/zenith --path pageserver
cargo install $CARGO_FLAGS --locked --root /tmp/zenith --path walkeeper
cargo install $CARGO_FLAGS --locked --root /tmp/zenith --path zenith
# Install the postgres binaries, for use by test jobs
# FIXME: this is a silly way to do "install"; maybe just output a standard
# postgres package, whatever the favored form is (tarball? .deb package?)
# Note that pg_regress needs some build artifacts that probably aren't
# in the usual package...?
- run:
name: postgres install
name: Install postgres binaries
command: |
cp -a tmp_install /tmp/zenith/pg_install
# Save the rust output binaries for other jobs in this workflow.
- run:
name: Merge coverage data
command: |
# This will speed up workspace uploads
if [[ $BUILD_TYPE == "debug" ]]; then
scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage merge
fi
# Save the rust binaries and coverage data for other jobs in this workflow.
- persist_to_workspace:
root: /tmp/zenith
paths:
- "*"
check-codestyle-python:
executor: zenith-executor
steps:
- checkout
- restore_cache:
keys:
- v1-python-deps-{{ checksum "poetry.lock" }}
- run:
name: Install deps
command: ./scripts/pysync
- save_cache:
key: v1-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
name: Run yapf to ensure code format
when: always
command: poetry run yapf --recursive --diff .
- run:
name: Run mypy to check types
when: always
command: poetry run mypy .
run-pytest:
#description: "Run pytest"
executor: python/default
executor: zenith-executor
parameters:
# pytest args to specify the tests to run.
#
@@ -204,6 +259,11 @@ jobs:
run_in_parallel:
type: boolean
default: true
save_perf_report:
type: boolean
default: false
environment:
BUILD_TYPE: << parameters.build_type >>
steps:
- attach_workspace:
at: /tmp/zenith
@@ -212,21 +272,35 @@ jobs:
condition: << parameters.needs_postgres_source >>
steps:
- run: git submodule update --init --depth 1
- restore_cache:
keys:
- v1-python-deps-{{ checksum "poetry.lock" }}
- run:
name: Install pipenv & deps
working_directory: test_runner
command: |
pip install pipenv
pipenv install
name: Install deps
command: ./scripts/pysync
- save_cache:
key: v1-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
name: Run pytest
working_directory: test_runner
# pytest doesn't output test logs in real time, so CI job may fail with
# `Too long with no output` error, if a test is running for a long time.
# In that case, tests should have internal timeouts that are less than
# no_output_timeout, specified here.
no_output_timeout: 10m
environment:
- ZENITH_BIN: /tmp/zenith/bin
- POSTGRES_DISTRIB_DIR: /tmp/zenith/pg_install
- TEST_OUTPUT: /tmp/test_output
# this variable will be embedded in perf test report
# and is needed to distinguish different environments
- PLATFORM: zenith-local-ci
command: |
TEST_SELECTION="<< parameters.test_selection >>"
PERF_REPORT_DIR="$(realpath test_runner/perf-report-local)"
rm -rf $PERF_REPORT_DIR
TEST_SELECTION="test_runner/<< parameters.test_selection >>"
EXTRA_PARAMS="<< parameters.extra_params >>"
if [ -z "$TEST_SELECTION" ]; then
echo "test_selection must be set"
@@ -234,18 +308,46 @@ jobs:
fi
if << parameters.run_in_parallel >>; then
EXTRA_PARAMS="-n4 $EXTRA_PARAMS"
fi;
fi
if << parameters.save_perf_report >>; then
if [[ $CIRCLE_BRANCH == "main" ]]; then
mkdir -p "$PERF_REPORT_DIR"
EXTRA_PARAMS="--out-dir $PERF_REPORT_DIR $EXTRA_PARAMS"
fi
fi
export GITHUB_SHA=$CIRCLE_SHA1
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
# Run the tests.
#
# The junit.xml file allows CircleCI to display more fine-grained test information
# in its "Tests" tab in the results page.
# -s prevents pytest from capturing output, which helps to see
# what's going on if the test hangs
# --verbose prints name of each test (helpful when there are
# multiple tests in one file)
# -rA prints summary in the end
# -n4 uses four processes to run tests via pytest-xdist
pipenv run pytest --junitxml=$TEST_OUTPUT/junit.xml --tb=short -s --verbose -rA $TEST_SELECTION $EXTRA_PARAMS
# -s is not used to prevent pytest from capturing output, because tests are running
# in parallel and logs are mixed between different tests
"${cov_prefix[@]}" ./scripts/pytest \
--junitxml=$TEST_OUTPUT/junit.xml \
--tb=short \
--verbose \
-m "not remote_cluster" \
-rA $TEST_SELECTION $EXTRA_PARAMS
if << parameters.save_perf_report >>; then
if [[ $CIRCLE_BRANCH == "main" ]]; then
export REPORT_FROM="$PERF_REPORT_DIR"
export REPORT_TO=local
scripts/generate_and_push_perf_report.sh
fi
fi
- run:
# CircleCI artifacts are preserved one file at a time, so skipping
# this step isn't a good idea. If you want to extract the
@@ -254,13 +356,73 @@ jobs:
when: always
command: |
du -sh /tmp/test_output/*
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "wal_acceptor.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" -delete
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" -delete
du -sh /tmp/test_output/*
- store_artifacts:
path: /tmp/test_output
# The store_test_results step tells CircleCI where to find the junit.xml file.
- store_test_results:
path: /tmp/test_output
- run:
name: Merge coverage data
command: |
# This will speed up workspace uploads
if [[ $BUILD_TYPE == "debug" ]]; then
scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage merge
fi
# Save coverage data (if any)
- persist_to_workspace:
root: /tmp/zenith
paths:
- "*"
coverage-report:
executor: zenith-xlarge-executor
steps:
- attach_workspace:
at: /tmp/zenith
- checkout
- restore_cache:
name: Restore rust cache
keys:
# Require an exact match. While an out of date cache might speed up the build,
# there's no way to clean out old packages, so the cache grows every time something
# changes.
- v04-rust-cache-deps-debug-{{ checksum "Cargo.lock" }}
- run:
name: Build coverage report
command: |
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
scripts/coverage \
--dir=/tmp/zenith/coverage report \
--input-objects=/tmp/zenith/etc/binaries.list \
--commit-url=$COMMIT_URL \
--format=github
- run:
name: Upload coverage report
command: |
LOCAL_REPO=$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME
REPORT_URL=https://zenithdb.github.io/zenith-coverage-data/$CIRCLE_SHA1
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
scripts/git-upload \
--repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-coverage-data.git \
--message="Add code coverage for $COMMIT_URL" \
copy /tmp/zenith/coverage/report $CIRCLE_SHA1 # COPY FROM TO_RELATIVE
# Add link to the coverage report to the commit
curl -f -X POST \
https://api.github.com/repos/$LOCAL_REPO/statuses/$CIRCLE_SHA1 \
-H "Accept: application/vnd.github.v3+json" \
--user "$CI_ACCESS_TOKEN" \
--data \
"{
\"state\": \"success\",
\"context\": \"zenith-coverage\",
\"description\": \"Coverage report is ready\",
\"target_url\": \"$REPORT_URL\"
}"
# Build zenithdb/zenith:latest image and push it to Docker hub
docker-image:
@@ -277,7 +439,101 @@ jobs:
name: Build and push Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
docker build -t zenithdb/zenith:latest . && docker push zenithdb/zenith:latest
DOCKER_TAG=$(git log --oneline|wc -l)
docker build --build-arg GIT_VERSION=$CIRCLE_SHA1 -t zenithdb/zenith:latest . && docker push zenithdb/zenith:latest
docker tag zenithdb/zenith:latest zenithdb/zenith:${DOCKER_TAG} && docker push zenithdb/zenith:${DOCKER_TAG}
# Build zenithdb/compute-node:latest image and push it to Docker hub
docker-image-compute:
docker:
- image: cimg/base:2021.04
steps:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Build zenithdb/compute-tools:latest image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run:
name: Build and push compute-tools Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
docker build -t zenithdb/compute-tools:latest -f Dockerfile.compute-tools .
docker push zenithdb/compute-tools:latest
- run:
name: Init postgres submodule
command: git submodule update --init --depth 1
- run:
name: Build and push compute-node Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
DOCKER_TAG=$(git log --oneline|wc -l)
docker build -t zenithdb/compute-node:latest vendor/postgres && docker push zenithdb/compute-node:latest
docker tag zenithdb/compute-node:latest zenithdb/compute-node:${DOCKER_TAG} && docker push zenithdb/compute-node:${DOCKER_TAG}
deploy-staging:
docker:
- image: cimg/python:3.10
steps:
- checkout
- setup_remote_docker
- run:
name: Get Zenith binaries
command: |
rm -rf zenith_install postgres_install.tar.gz zenith_install.tar.gz
mkdir zenith_install
DOCKER_TAG=$(git log --oneline|wc -l)
docker pull --quiet zenithdb/zenith:${DOCKER_TAG}
ID=$(docker create zenithdb/zenith:${DOCKER_TAG})
docker cp $ID:/data/postgres_install.tar.gz .
tar -xzf postgres_install.tar.gz -C zenith_install && rm postgres_install.tar.gz
docker cp $ID:/usr/local/bin/pageserver zenith_install/bin/
docker cp $ID:/usr/local/bin/safekeeper zenith_install/bin/
docker cp $ID:/usr/local/bin/proxy zenith_install/bin/
docker cp $ID:/usr/local/bin/postgres zenith_install/bin/
docker rm -v $ID
echo ${DOCKER_TAG} | tee zenith_install/.zenith_current_version
tar -czf zenith_install.tar.gz -C zenith_install .
ls -la zenith_install.tar.gz
- run:
name: Setup ansible
command: |
pip install --progress-bar off --user ansible boto3
ansible-galaxy collection install amazon.aws
- run:
name: Apply re-deploy playbook
environment:
ANSIBLE_HOST_KEY_CHECKING: false
command: |
echo "${STAGING_SSH_KEY}" | base64 --decode | ssh-add -
export AWS_REGION=${STAGING_AWS_REGION}
export AWS_ACCESS_KEY_ID=${STAGING_AWS_ACCESS_KEY_ID}
export AWS_SECRET_ACCESS_KEY=${STAGING_AWS_SECRET_ACCESS_KEY}
ansible-playbook .circleci/storage-redeploy.playbook.yml
rm -f zenith_install.tar.gz
deploy-staging-proxy:
docker:
- image: cimg/base:2021.04
environment:
KUBECONFIG: .kubeconfig
steps:
- checkout
- run:
name: Store kubeconfig file
command: |
echo "${STAGING_KUBECONFIG_DATA}" | base64 --decode > ${KUBECONFIG}
chmod 0600 ${KUBECONFIG}
- run:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add zenithdb https://zenithdb.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
DOCKER_TAG=$(git log --oneline|wc -l)
helm upgrade zenith-proxy zenithdb/zenith-proxy --install -f .circleci/proxy.staging.yaml --set image.tag=${DOCKER_TAG} --wait
# Trigger a new remote CI job
remote-ci-trigger:
@@ -319,25 +575,30 @@ jobs:
\"inputs\": {
\"ci_job_name\": \"zenith-remote-ci\",
\"commit_hash\": \"$CIRCLE_SHA1\",
\"remote_repo\": \"$LOCAL_REPO\",
\"zenith_image_branch\": \"$CIRCLE_BRANCH\"
\"remote_repo\": \"$LOCAL_REPO\"
}
}"
workflows:
build_and_test:
jobs:
- check-codestyle
- build-postgres
- check-codestyle-rust
- check-codestyle-python
- build-postgres:
name: build-postgres-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
- build-zenith:
name: build-zenith-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
requires:
- build-postgres
- build-postgres-<< matrix.build_type >>
- run-pytest:
name: pg_regress-tests-<< matrix.build_type >>
context: PERF_TEST_RESULT_CONNSTR
matrix:
parameters:
build_type: ["debug", "release"]
@@ -355,11 +616,19 @@ workflows:
- build-zenith-<< matrix.build_type >>
- run-pytest:
name: benchmarks
context: PERF_TEST_RESULT_CONNSTR
build_type: release
test_selection: performance
run_in_parallel: false
save_perf_report: true
requires:
- build-zenith-release
- coverage-report:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN
requires:
# TODO: consider adding more
- other-tests-debug
- docker-image:
# Context gives an ability to login
context: Docker Hub
@@ -371,6 +640,35 @@ workflows:
requires:
- pg_regress-tests-release
- other-tests-release
- docker-image-compute:
# Context gives an ability to login
context: Docker Hub
# Build image only for commits to main
filters:
branches:
only:
- main
requires:
- pg_regress-tests-release
- other-tests-release
- deploy-staging:
# Context gives an ability to login
context: Docker Hub
# deploy only for commits to main
filters:
branches:
only:
- main
requires:
- docker-image
- deploy-staging-proxy:
# deploy only for commits to main
filters:
branches:
only:
- main
requires:
- docker-image
- remote-ci-trigger:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN

View File

@@ -0,0 +1,27 @@
# Helm chart values for zenith-proxy.
# This is a YAML-formatted file.
settings:
authEndpoint: "https://console.stage.zenith.tech/authenticate_proxy_request/"
uri: "https://console.stage.zenith.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
zenith_service: proxy
zenith_env: staging
zenith_region: us-east-1
zenith_region_slug: virginia
exposedService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: start.stage.zenith.tech
metrics:
enabled: true
serviceMonitor:
enabled: true
selector:
release: kube-prometheus-stack

View File

@@ -0,0 +1,138 @@
- name: discover storage nodes
hosts: localhost
connection: local
gather_facts: False
tasks:
- name: discover safekeepers
no_log: true
ec2_instance_info:
filters:
"tag:zenith_env": "staging"
"tag:zenith_service": "safekeeper"
register: ec2_safekeepers
- name: discover pageservers
no_log: true
ec2_instance_info:
filters:
"tag:zenith_env": "staging"
"tag:zenith_service": "pageserver"
register: ec2_pageservers
- name: add safekeepers to host group
no_log: true
add_host:
name: safekeeper-{{ ansible_loop.index }}
ansible_host: "{{ item.public_ip_address }}"
groups:
- storage
- safekeepers
with_items: "{{ ec2_safekeepers.instances }}"
loop_control:
extended: yes
- name: add pageservers to host group
no_log: true
add_host:
name: pageserver-{{ ansible_loop.index }}
ansible_host: "{{ item.public_ip_address }}"
groups:
- storage
- pageservers
with_items: "{{ ec2_pageservers.instances }}"
loop_control:
extended: yes
- name: Retrive versions
hosts: storage
gather_facts: False
remote_user: admin
tasks:
- name: Get current version of binaries
set_fact:
current_version: "{{lookup('file', '../zenith_install/.zenith_current_version') }}"
- name: Check that file with version exists on host
stat:
path: /usr/local/.zenith_current_version
register: version_file
- name: Try to get current version from the host
when: version_file.stat.exists
ansible.builtin.fetch:
src: /usr/local/.zenith_current_version
dest: .remote_version.{{ inventory_hostname }}
fail_on_missing: no
flat: yes
- name: Store remote version to variable
when: version_file.stat.exists
set_fact:
remote_version: "{{ lookup('file', '.remote_version.{{ inventory_hostname }}') }}"
- name: Store default value of remote version to variable in case when remote version file not found
when: not version_file.stat.exists
set_fact:
remote_version: "000"
- name: Extract Zenith binaries
hosts: storage
gather_facts: False
remote_user: admin
tasks:
- name: Inform about version conflict
when: current_version <= remote_version
debug: msg="Current version {{ current_version }} LE than remote {{ remote_version }}"
- name: Extract Zenith binaries to /usr/local
when: current_version > remote_version
ansible.builtin.unarchive:
src: ../zenith_install.tar.gz
dest: /usr/local
become: true
- name: Restart safekeepers
hosts: safekeepers
gather_facts: False
remote_user: admin
tasks:
- name: Inform about version conflict
when: current_version <= remote_version
debug: msg="Current version {{ current_version }} LE than remote {{ remote_version }}"
- name: Restart systemd service
when: current_version > remote_version
ansible.builtin.systemd:
daemon_reload: yes
name: safekeeper
enabled: yes
state: restarted
become: true
- name: Restart pageservers
hosts: pageservers
gather_facts: False
remote_user: admin
tasks:
- name: Inform about version conflict
when: current_version <= remote_version
debug: msg="Current version {{ current_version }} LE than remote {{ remote_version }}"
- name: Restart systemd service
when: current_version > remote_version
ansible.builtin.systemd:
daemon_reload: yes
name: pageserver
enabled: yes
state: restarted
become: true

View File

@@ -2,12 +2,17 @@
**/__pycache__
**/.pytest_cache
/target
/tmp_check
/tmp_install
/tmp_check_cli
/test_output
/.vscode
/.zenith
/integration_tests/.zenith
/Dockerfile
.git
target
tmp_check
tmp_install
tmp_check_cli
test_output
.vscode
.zenith
integration_tests/.zenith
.mypy_cache
Dockerfile
.dockerignore

103
.github/workflows/benchmarking.yml vendored Normal file
View File

@@ -0,0 +1,103 @@
name: benchmarking
on:
# uncomment to run on push for debugging your PR
# push:
# branches: [ your branch ]
schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '36 7 * * *' # run once a day, timezone is utc
workflow_dispatch: # adds ability to run this manually
jobs:
bench:
# this workflow runs on self hosteed runner
# it's environment is quite different from usual guthub runner
# probably the most important difference is that it doesnt start from clean workspace each time
# e g if you install system packages they are not cleaned up since you install them directly in host machine
# not a container or something
# See documentation for more info: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners
runs-on: [self-hosted, zenith-benchmarker]
env:
PG_BIN: "/usr/pgsql-13/bin"
steps:
- name: Checkout zenith repo
uses: actions/checkout@v2
# actions/setup-python@v2 is not working correctly on self-hosted runners
# see https://github.com/actions/setup-python/issues/162
# and probably https://github.com/actions/setup-python/issues/162#issuecomment-865387976 in particular
# so the simplest solution to me is to use already installed system python and spin virtualenvs for job runs.
# there is Python 3.7.10 already installed on the machine so use it to install poetry and then use poetry's virtuealenvs
- name: Install poetry & deps
run: |
python3 -m pip install --upgrade poetry wheel
# since pip/poetry caches are reused there shouldn't be any troubles with install every time
./scripts/pysync
- name: Show versions
run: |
echo Python
python3 --version
poetry run python3 --version
echo Pipenv
poetry --version
echo Pgbench
$PG_BIN/pgbench --version
# FIXME cluster setup is skipped due to various changes in console API
# for now pre created cluster is used. When API gain some stability
# after massive changes dynamic cluster setup will be revived.
# So use pre created cluster. It needs to be started manually, but stop is automatic after 5 minutes of inactivity
- name: Setup cluster
env:
BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}"
shell: bash
run: |
set -e
echo "Starting cluster"
# wake up the cluster
$PG_BIN/psql $BENCHMARK_CONNSTR -c "SELECT 1"
- name: Run benchmark
# pgbench is installed system wide from official repo
# https://download.postgresql.org/pub/repos/yum/13/redhat/rhel-7-x86_64/
# via
# sudo tee /etc/yum.repos.d/pgdg.repo<<EOF
# [pgdg13]
# name=PostgreSQL 13 for RHEL/CentOS 7 - x86_64
# baseurl=https://download.postgresql.org/pub/repos/yum/13/redhat/rhel-7-x86_64/
# enabled=1
# gpgcheck=0
# EOF
# sudo yum makecache
# sudo yum install postgresql13-contrib
# actual binaries are located in /usr/pgsql-13/bin/
env:
TEST_PG_BENCH_TRANSACTIONS_MATRIX: "5000,10000,20000"
TEST_PG_BENCH_SCALES_MATRIX: "10,15"
PLATFORM: "zenith-staging"
BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}"
REMOTE_ENV: "1" # indicate to test harness that we do not have zenith binaries locally
run: |
# just to be sure that no data was cached on self hosted runner
# since it might generate duplicates when calling ingest_perf_test_result.py
rm -rf perf-report-staging
mkdir -p perf-report-staging
./scripts/pytest test_runner/performance/ -v -m "remote_cluster" --skip-interfering-proc-check --out-dir perf-report-staging
- name: Submit result
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
run: |
REPORT_FROM=$(realpath perf-report-staging) REPORT_TO=staging scripts/generate_and_push_perf_report.sh

View File

@@ -64,10 +64,11 @@ jobs:
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
# Use `env CARGO_INCREMENTAL=0` to mitigate https://github.com/rust-lang/rust/issues/91696 for rustc 1.57.0
- name: Run cargo build
run: |
cargo build --workspace --bins --examples --tests
env CARGO_INCREMENTAL=0 cargo build --workspace --bins --examples --tests
- name: Run cargo test
run: |
cargo test -- --nocapture --test-threads=1
env CARGO_INCREMENTAL=0 cargo test -- --nocapture --test-threads=1

4
.gitignore vendored
View File

@@ -7,3 +7,7 @@ test_output/
.vscode
/.zenith
/integration_tests/.zenith
# Coverage
*.profraw
*.profdata

10
.yapfignore Normal file
View File

@@ -0,0 +1,10 @@
# This file is only read when `yapf` is run from this directory.
# Hence we only top-level directories here to avoid confusion.
# See source code for the exact file format: https://github.com/google/yapf/blob/c6077954245bc3add82dafd853a1c7305a6ebd20/yapf/yapflib/file_resources.py#L40-L43
vendor/
target/
tmp_install/
__pycache__/
test_output/
.zenith/
.git/

1604
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,5 +1,6 @@
[workspace]
members = [
"compute_tools",
"control_plane",
"pageserver",
"postgres_ffi",
@@ -15,3 +16,8 @@ members = [
# This is useful for profiling and, to some extent, debug.
# Besides, debug info should not affect the performance.
debug = true
# This is only needed for proxy's tests
# TODO: we should probably fork tokio-postgres-rustls instead
[patch.crates-io]
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }

View File

@@ -10,40 +10,26 @@ FROM zenithdb/build:buster AS pg-build
WORKDIR /zenith
COPY ./vendor/postgres vendor/postgres
COPY ./Makefile Makefile
ENV BUILD_TYPE release
RUN make -j $(getconf _NPROCESSORS_ONLN) -s postgres
#
# Calculate cargo dependencies.
# This will always run, but only generate recipe.json with list of dependencies without
# installing them.
#
FROM zenithdb/build:buster AS cargo-deps-inspect
WORKDIR /zenith
COPY . .
RUN cargo chef prepare --recipe-path /zenith/recipe.json
#
# Build cargo dependencies.
# This temp cantainner should be rebuilt only if recipe.json was changed.
#
FROM zenithdb/build:buster AS deps-build
WORKDIR /zenith
COPY --from=pg-build /zenith/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY --from=cargo-deps-inspect /usr/local/cargo/bin/cargo-chef /usr/local/cargo/bin/
COPY --from=cargo-deps-inspect /zenith/recipe.json recipe.json
RUN ROCKSDB_LIB_DIR=/usr/lib/ cargo chef cook --release --recipe-path recipe.json
RUN rm -rf postgres_install/build
#
# Build zenith binaries
#
# TODO: build cargo deps as separate layer. We used cargo-chef before but that was
# net time waste in a lot of cases. Copying Cargo.lock with empty lib.rs should do the work.
#
FROM zenithdb/build:buster AS build
ARG GIT_VERSION
RUN if [ -z "$GIT_VERSION" ]; then echo "GIT_VERSION is reqired, use build_arg to pass it"; exit 1; fi
WORKDIR /zenith
COPY . .
# Copy cached dependencies
COPY --from=pg-build /zenith/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY --from=deps-build /zenith/target target
COPY --from=deps-build /usr/local/cargo/ /usr/local/cargo/
RUN cargo build --release
COPY . .
RUN GIT_VERSION=$GIT_VERSION cargo build --release
#
# Copy binaries to resulting image.
@@ -51,11 +37,11 @@ RUN cargo build --release
FROM debian:buster-slim
WORKDIR /data
RUN apt-get update && apt-get -yq install librocksdb-dev libseccomp-dev openssl && \
RUN apt-get update && apt-get -yq install libreadline-dev libseccomp-dev openssl ca-certificates && \
mkdir zenith_install
COPY --from=build /zenith/target/release/pageserver /usr/local/bin
COPY --from=build /zenith/target/release/wal_acceptor /usr/local/bin
COPY --from=build /zenith/target/release/safekeeper /usr/local/bin
COPY --from=build /zenith/target/release/proxy /usr/local/bin
COPY --from=pg-build /zenith/tmp_install postgres_install
COPY docker-entrypoint.sh /docker-entrypoint.sh

View File

@@ -81,7 +81,7 @@ FROM alpine:3.13
RUN apk add --update openssl build-base libseccomp-dev
RUN apk --no-cache --update --repository https://dl-cdn.alpinelinux.org/alpine/edge/testing add rocksdb
COPY --from=build /zenith/target/release/pageserver /usr/local/bin
COPY --from=build /zenith/target/release/wal_acceptor /usr/local/bin
COPY --from=build /zenith/target/release/safekeeper /usr/local/bin
COPY --from=build /zenith/target/release/proxy /usr/local/bin
COPY --from=pg-build /zenith/tmp_install /usr/local
COPY docker-entrypoint.sh /docker-entrypoint.sh

View File

@@ -2,14 +2,15 @@
# Image with all the required dependencies to build https://github.com/zenithdb/zenith
# and Postgres from https://github.com/zenithdb/postgres
# Also includes some rust development and build tools.
# NB: keep in sync with rust image version in .circle/config.yml
#
FROM rust:slim-buster
FROM rust:1.56.1-slim-buster
WORKDIR /zenith
# Install postgres and zenith build dependencies
# clang is for rocksdb
RUN apt-get update && apt-get -yq install automake libtool build-essential bison flex libreadline-dev zlib1g-dev libxml2-dev \
libseccomp-dev pkg-config libssl-dev librocksdb-dev clang
libseccomp-dev pkg-config libssl-dev clang
# Install rust tools
RUN rustup component add clippy && cargo install cargo-chef cargo-audit
RUN rustup component add clippy && cargo install cargo-audit

14
Dockerfile.compute-tools Normal file
View File

@@ -0,0 +1,14 @@
# First transient image to build compute_tools binaries
# NB: keep in sync with rust image version in .circle/config.yml
FROM rust:1.56.1-slim-buster AS rust-build
WORKDIR /zenith
COPY . .
RUN cargo build -p compute_tools --release
# Final image that only has one binary
FROM debian:buster-slim
COPY --from=rust-build /zenith/target/release/zenith_ctl /usr/local/bin/zenith_ctl

View File

@@ -6,34 +6,55 @@ else
SECCOMP =
endif
#
# We differentiate between release / debug build types using the BUILD_TYPE
# environment variable.
#
BUILD_TYPE ?= debug
ifeq ($(BUILD_TYPE),release)
PG_CONFIGURE_OPTS = --enable-debug
PG_CFLAGS = -O2 -g3 $(CFLAGS)
# Unfortunately, `--profile=...` is a nightly feature
CARGO_BUILD_FLAGS += --release
else ifeq ($(BUILD_TYPE),debug)
PG_CONFIGURE_OPTS = --enable-debug --enable-cassert --enable-depend
PG_CFLAGS = -O0 -g3 $(CFLAGS)
else
$(error Bad build type `$(BUILD_TYPE)', see Makefile for options)
endif
# Choose whether we should be silent or verbose
CARGO_BUILD_FLAGS += --$(if $(filter s,$(MAKEFLAGS)),quiet,verbose)
# Fix for a corner case when make doesn't pass a jobserver
CARGO_BUILD_FLAGS += $(filter -j1,$(MAKEFLAGS))
# This option has a side effect of passing make jobserver to cargo.
# However, we shouldn't do this if `make -n` (--dry-run) has been asked.
CARGO_CMD_PREFIX += $(if $(filter n,$(MAKEFLAGS)),,+)
# Force cargo not to print progress bar
CARGO_CMD_PREFIX += CARGO_TERM_PROGRESS_WHEN=never CI=1
#
# Top level Makefile to build Zenith and PostgreSQL
#
.PHONY: all
all: zenith postgres
# We don't want to run 'cargo build' in parallel with the postgres build,
# because interleaving cargo build output with postgres build output looks
# confusing. Also, 'cargo build' is parallel on its own, so it would be too
# much parallelism. (Recursive invocation of postgres target still gets any
# '-j' flag from the command line, so 'make -j' is still useful.)
.NOTPARALLEL:
### Zenith Rust bits
#
# The 'postgres_ffi' depends on the Postgres headers.
.PHONY: zenith
zenith: postgres-headers
cargo build
+@echo "Compiling Zenith"
$(CARGO_CMD_PREFIX) cargo build $(CARGO_BUILD_FLAGS)
### PostgreSQL parts
tmp_install/build/config.status:
+@echo "Configuring postgres build"
mkdir -p tmp_install/build
(cd tmp_install/build && \
../../vendor/postgres/configure CFLAGS='-O0 -g3 $(CFLAGS)' \
--enable-cassert \
--enable-debug \
--enable-depend \
../../vendor/postgres/configure CFLAGS='$(PG_CFLAGS)' \
$(PG_CONFIGURE_OPTS) \
$(SECCOMP) \
--prefix=$(abspath tmp_install) > configure.log)
@@ -47,10 +68,10 @@ postgres-headers: postgres-configure
+@echo "Installing PostgreSQL headers"
$(MAKE) -C tmp_install/build/src/include MAKELEVEL=0 install
# Compile and install PostgreSQL and contrib/zenith
.PHONY: postgres
postgres: postgres-configure
postgres: postgres-configure \
postgres-headers # to prevent `make install` conflicts with zenith's `postgres-headers`
+@echo "Compiling PostgreSQL"
$(MAKE) -C tmp_install/build MAKELEVEL=0 install
+@echo "Compiling contrib/zenith"
@@ -58,18 +79,21 @@ postgres: postgres-configure
+@echo "Compiling contrib/zenith_test_utils"
$(MAKE) -C tmp_install/build/contrib/zenith_test_utils install
.PHONY: postgres-clean
postgres-clean:
$(MAKE) -C tmp_install/build MAKELEVEL=0 clean
# This doesn't remove the effects of 'configure'.
.PHONY: clean
clean:
cd tmp_install/build && ${MAKE} clean
cargo clean
cd tmp_install/build && $(MAKE) clean
$(CARGO_CMD_PREFIX) cargo clean
# This removes everything
.PHONY: distclean
distclean:
rm -rf tmp_install
cargo clean
$(CARGO_CMD_PREFIX) cargo clean
.PHONY: fmt
fmt:

View File

@@ -1 +0,0 @@
./test_runner/Pipfile

1
Pipfile.lock generated
View File

@@ -1 +0,0 @@
./test_runner/Pipfile.lock

View File

@@ -1,12 +1,12 @@
# Zenith
Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes
Zenith is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
## Architecture overview
A Zenith installation consists of Compute nodes and Storage engine.
A Zenith installation consists of compute nodes and Zenith storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by zenith storage.
Compute nodes are stateless PostgreSQL nodes, backed by Zenith storage engine.
Zenith storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
@@ -25,15 +25,15 @@ Pageserver consists of:
On Ubuntu or Debian this set of packages should be sufficient to build the code:
```text
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang
libssl-dev clang pkg-config libpq-dev
```
[Rust] 1.52 or later is also required.
[Rust] 1.56.1 or later is also required.
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests (not required to use the code), install
Python (3.6 or higher), and install python3 packages with `pipenv` using `pipenv install` in the project directory.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
2. Build zenith and patched postgres
```sh
@@ -47,17 +47,26 @@ make -j5
# Create repository in .zenith with proper paths to binaries and data
# Later that would be responsibility of a package install script
> ./target/debug/zenith init
initializing tenantid c03ba6b7ad4c5e9cf556f059ade44229
created initial timeline 5b014a9e41b4b63ce1a1febc04503636 timeline.lsn 0/169C3C8
created main branch
pageserver init succeeded
# start pageserver
# start pageserver and safekeeper
> ./target/debug/zenith start
Starting pageserver at '127.0.0.1:64000' in .zenith
Starting pageserver at 'localhost:64000' in '.zenith'
Pageserver started
initializing for single for 7676
Starting safekeeper at 'localhost:5454' in '.zenith/safekeepers/single'
Safekeeper started
# start postgres on top on the pageserver
# start postgres compute node
> ./target/debug/zenith pg start main
Starting postgres node at 'host=127.0.0.1 port=55432 user=stas'
Starting new postgres main on main...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/c03ba6b7ad4c5e9cf556f059ade44229/main port=55432
Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres'
waiting for server to start.... done
server started
# check list of running postgres instances
> ./target/debug/zenith pg list
@@ -108,13 +117,18 @@ postgres=# insert into t values(2,2);
INSERT 0 1
```
6. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances
you have just started. You can stop them all with one command:
```sh
> ./target/debug/zenith stop
```
## Running tests
```sh
git clone --recursive https://github.com/zenithdb/zenith.git
make # builds also postgres and installs it to ./tmp_install
cd test_runner
pytest
./scripts/pytest
```
## Documentation

View File

@@ -0,0 +1 @@
target

1
compute_tools/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
target

19
compute_tools/Cargo.toml Normal file
View File

@@ -0,0 +1,19 @@
[package]
name = "compute_tools"
version = "0.1.0"
edition = "2021"
[dependencies]
libc = "0.2"
anyhow = "1.0"
chrono = "0.4"
clap = "3.0"
env_logger = "0.9"
hyper = { version = "0.14", features = ["full"] }
log = { version = "0.4", features = ["std", "serde"] }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
regex = "1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tar = "0.4"
tokio = { version = "1", features = ["macros", "rt", "rt-multi-thread"] }

81
compute_tools/README.md Normal file
View File

@@ -0,0 +1,81 @@
# Compute node tools
Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
`ExecStart` option. It will handle all the `zenith` specifics during compute node
initialization:
- `zenith_ctl` accepts cluster (compute node) specification as a JSON file.
- Every start is a fresh start, so the data directory is removed and
initialized again on each run.
- Next it will put configuration files into the `PGDATA` directory.
- Sync safekeepers and get commit LSN.
- Get `basebackup` from pageserver using the returned on the previous step LSN.
- Try to start `postgres` and wait until it is ready to accept connections.
- Check and alter/drop/create roles and databases.
- Hang waiting on the `postmaster` process to exit.
Also `zenith_ctl` spawns two separate service threads:
- `compute-monitor` checks the last Postgres activity timestamp and saves it
into the shared `ComputeState`;
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
Usage example:
```sh
zenith_ctl -D /var/db/postgres/compute \
-C 'postgresql://zenith_admin@localhost/postgres' \
-S /var/db/postgres/specs/current.json \
-b /usr/local/bin/postgres
```
## Tests
Cargo formatter:
```sh
cargo fmt
```
Run tests:
```sh
cargo test
```
Clippy linter:
```sh
cargo clippy --all --all-targets -- -Dwarnings -Drust-2018-idioms
```
## Cross-platform compilation
Imaging that you are on macOS (x86) and you want a Linux GNU (`x86_64-unknown-linux-gnu` platform in `rust` terminology) executable.
### Using docker
You can use a throw-away Docker container ([rustlang/rust](https://hub.docker.com/r/rustlang/rust/) image) for doing that:
```sh
docker run --rm \
-v $(pwd):/compute_tools \
-w /compute_tools \
-t rustlang/rust:nightly cargo build --release --target=x86_64-unknown-linux-gnu
```
or one-line:
```sh
docker run --rm -v $(pwd):/compute_tools -w /compute_tools -t rust:latest cargo build --release --target=x86_64-unknown-linux-gnu
```
### Using rust native cross-compilation
Another way is to add `x86_64-unknown-linux-gnu` target on your host system:
```sh
rustup target add x86_64-unknown-linux-gnu
```
Install macOS cross-compiler toolchain:
```sh
brew tap SergioBenitez/osxct
brew install x86_64-unknown-linux-gnu
```
And finally run `cargo build`:
```sh
CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=x86_64-unknown-linux-gnu-gcc cargo build --target=x86_64-unknown-linux-gnu --release
```

View File

@@ -0,0 +1 @@
max_width = 100

View File

@@ -0,0 +1,249 @@
//!
//! Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
//! `ExecStart` option. It will handle all the `zenith` specifics during compute node
//! initialization:
//! - `zenith_ctl` accepts cluster (compute node) specification as a JSON file.
//! - Every start is a fresh start, so the data directory is removed and
//! initialized again on each run.
//! - Next it will put configuration files into the `PGDATA` directory.
//! - Sync safekeepers and get commit LSN.
//! - Get `basebackup` from pageserver using the returned on the previous step LSN.
//! - Try to start `postgres` and wait until it is ready to accept connections.
//! - Check and alter/drop/create roles and databases.
//! - Hang waiting on the `postmaster` process to exit.
//!
//! Also `zenith_ctl` spawns two separate service threads:
//! - `compute-monitor` checks the last Postgres activity timestamp and saves it
//! into the shared `ComputeState`;
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! Usage example:
//! ```sh
//! zenith_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://zenith_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres
//! ```
//!
use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::{exit, Command, ExitStatus};
use std::sync::{Arc, RwLock};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use log::info;
use postgres::{Client, NoTls};
use compute_tools::config;
use compute_tools::http_api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
use compute_tools::pg_helpers::*;
use compute_tools::spec::*;
use compute_tools::zenith::*;
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
fn prepare_pgdata(state: &Arc<RwLock<ComputeState>>) -> Result<()> {
let state = state.read().unwrap();
let spec = &state.spec;
let pgdata_path = Path::new(&state.pgdata);
let pageserver_connstr = spec
.cluster
.settings
.find("zenith.page_server_connstring")
.expect("pageserver connstr should be provided");
let tenant = spec
.cluster
.settings
.find("zenith.zenith_tenant")
.expect("tenant id should be provided");
let timeline = spec
.cluster
.settings
.find("zenith.zenith_timeline")
.expect("tenant id should be provided");
info!(
"starting cluster #{}, operation #{}",
spec.cluster.cluster_id,
spec.operation_uuid.as_ref().unwrap()
);
// Remove/create an empty pgdata directory and put configuration there.
create_pgdata(&state.pgdata)?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
info!("starting safekeepers syncing");
let lsn = sync_safekeepers(&state.pgdata, &state.pgbin)
.with_context(|| "failed to sync safekeepers")?;
info!("safekeepers synced at LSN {}", lsn);
info!(
"getting basebackup@{} from pageserver {}",
lsn, pageserver_connstr
);
get_basebackup(&state.pgdata, &pageserver_connstr, &tenant, &timeline, &lsn).with_context(
|| {
format!(
"failed to get basebackup@{} from pageserver {}",
lsn, pageserver_connstr
)
},
)?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
let read_state = state.read().unwrap();
let pgdata_path = Path::new(&read_state.pgdata);
// Run postgres as a child process.
let mut pg = Command::new(&read_state.pgbin)
.args(&["-D", &read_state.pgdata])
.spawn()
.expect("cannot start postgres process");
// Try default Postgres port if it is not provided
let port = read_state
.spec
.cluster
.settings
.find("port")
.unwrap_or_else(|| "5432".to_string());
wait_for_postgres(&port, pgdata_path)?;
let mut client = Client::connect(&read_state.connstr, NoTls)?;
handle_roles(&read_state.spec, &mut client)?;
handle_databases(&read_state.spec, &mut client)?;
// 'Close' connection
drop(client);
info!(
"finished configuration of cluster #{}",
read_state.spec.cluster.cluster_id
);
// Release the read lock.
drop(read_state);
// Get the write lock, update state and release the lock, so HTTP API
// was able to serve requests, while we are blocked waiting on
// Postgres.
let mut state = state.write().unwrap();
state.ready = true;
drop(state);
// Wait for child postgres process basically forever. In this state Ctrl+C
// will be propagated to postgres and it will be shut down as well.
let ecode = pg.wait().expect("failed to wait on postgres");
Ok(ecode)
}
fn main() -> Result<()> {
// TODO: re-use `zenith_utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
// Env variable is set by `cargo`
let version: Option<&str> = option_env!("CARGO_PKG_VERSION");
let matches = clap::App::new("zenith_ctl")
.version(version.unwrap_or("unknown"))
.arg(
Arg::new("connstr")
.short('C')
.long("connstr")
.value_name("DATABASE_URL")
.required(true),
)
.arg(
Arg::new("pgdata")
.short('D')
.long("pgdata")
.value_name("DATADIR")
.required(true),
)
.arg(
Arg::new("pgbin")
.short('b')
.long("pgbin")
.value_name("POSTGRES_PATH"),
)
.arg(
Arg::new("spec")
.short('s')
.long("spec")
.value_name("SPEC_JSON"),
)
.arg(
Arg::new("spec-path")
.short('S')
.long("spec-path")
.value_name("SPEC_PATH"),
)
.get_matches();
let pgdata = matches.value_of("pgdata").expect("PGDATA path is required");
let connstr = matches
.value_of("connstr")
.expect("Postgres connection string is required");
let spec = matches.value_of("spec");
let spec_path = matches.value_of("spec-path");
// Try to use just 'postgres' if no path is provided
let pgbin = matches.value_of("pgbin").unwrap_or("postgres");
let spec: ClusterSpec = match spec {
// First, try to get cluster spec from the cli argument
Some(json) => serde_json::from_str(json)?,
None => {
// Second, try to read it from the file if path is provided
if let Some(sp) = spec_path {
let path = Path::new(sp);
let file = File::open(path)?;
serde_json::from_reader(file)?
} else {
panic!("cluster spec should be provided via --spec or --spec-path argument");
}
}
};
let compute_state = ComputeState {
connstr: connstr.to_string(),
pgdata: pgdata.to_string(),
pgbin: pgbin.to_string(),
spec,
ready: false,
last_active: Utc::now(),
};
let compute_state = Arc::new(RwLock::new(compute_state));
// Launch service threads first, so we were able to serve availability
// requests, while configuration is still in progress.
let mut _threads = vec![
launch_http_server(&compute_state).expect("cannot launch compute monitor thread"),
launch_monitor(&compute_state).expect("cannot launch http endpoint thread"),
];
prepare_pgdata(&compute_state)?;
// Run compute (Postgres) and hang waiting on it. Panic if any error happens,
// it will help us to trigger unwind and kill postmaster as well.
match run_compute(&compute_state) {
Ok(ec) => exit(ec.success() as i32),
Err(error) => panic!("cannot start compute node, error: {}", error),
}
}

View File

@@ -0,0 +1,51 @@
use std::fs::{File, OpenOptions};
use std::io;
use std::io::prelude::*;
use std::path::Path;
use anyhow::Result;
use crate::pg_helpers::PgOptionsSerialize;
use crate::zenith::ClusterSpec;
/// Check that `line` is inside a text file and put it there if it is not.
/// Create file if it doesn't exist.
pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
let mut file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.append(false)
.open(path)?;
let buf = io::BufReader::new(&file);
let mut count: usize = 0;
for l in buf.lines() {
if l? == line {
return Ok(false);
}
count = 1;
}
write!(file, "{}{}", "\n".repeat(count), line)?;
Ok(true)
}
/// Create or completely rewrite configuration file specified by `path`
pub fn write_postgres_conf(path: &Path, spec: &ClusterSpec) -> Result<()> {
// File::create() destroys the file content if it exists.
let mut postgres_conf = File::create(path)?;
write_zenith_managed_block(&mut postgres_conf, &spec.cluster.settings.as_pg_settings())?;
Ok(())
}
// Write Postgres config block wrapped with generated comment section
fn write_zenith_managed_block(file: &mut File, buf: &str) -> Result<()> {
writeln!(file, "# Managed by Zenith: begin")?;
writeln!(file, "{}", buf)?;
writeln!(file, "# Managed by Zenith: end")?;
Ok(())
}

View File

@@ -0,0 +1,73 @@
use std::convert::Infallible;
use std::net::SocketAddr;
use std::sync::{Arc, RwLock};
use std::thread;
use anyhow::Result;
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Method, Request, Response, Server, StatusCode};
use log::{error, info};
use crate::zenith::*;
// Service function to handle all available routes.
fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body> {
match (req.method(), req.uri().path()) {
// Timestamp of the last Postgres activity in the plain text.
(&Method::GET, "/last_activity") => {
info!("serving /last_active GET request");
let state = state.read().unwrap();
// Use RFC3339 format for consistency.
Response::new(Body::from(state.last_active.to_rfc3339()))
}
// Has compute setup process finished? -> true/false
(&Method::GET, "/ready") => {
info!("serving /ready GET request");
let state = state.read().unwrap();
Response::new(Body::from(format!("{}", state.ready)))
}
// Return the `404 Not Found` for any other routes.
_ => {
let mut not_found = Response::new(Body::from("404 Not Found"));
*not_found.status_mut() = StatusCode::NOT_FOUND;
not_found
}
}
}
// Main Hyper HTTP server function that runs it and blocks waiting on it forever.
#[tokio::main]
async fn serve(state: Arc<RwLock<ComputeState>>) {
let addr = SocketAddr::from(([0, 0, 0, 0], 3080));
let make_service = make_service_fn(move |_conn| {
let state = state.clone();
async move {
Ok::<_, Infallible>(service_fn(move |req: Request<Body>| {
let state = state.clone();
async move { Ok::<_, Infallible>(routes(req, state)) }
}))
}
});
info!("starting HTTP server on {}", addr);
let server = Server::bind(&addr).serve(make_service);
// Run this server forever
if let Err(e) = server.await {
error!("server error: {}", e);
}
}
/// Launch a separate Hyper HTTP API server thread and return its `JoinHandle`.
pub fn launch_http_server(state: &Arc<RwLock<ComputeState>>) -> Result<thread::JoinHandle<()>> {
let state = Arc::clone(state);
Ok(thread::Builder::new()
.name("http-endpoint".into())
.spawn(move || serve(state))?)
}

13
compute_tools/src/lib.rs Normal file
View File

@@ -0,0 +1,13 @@
//!
//! Various tools and helpers to handle cluster / compute node (Postgres)
//! configuration.
//!
pub mod config;
pub mod http_api;
#[macro_use]
pub mod logger;
pub mod monitor;
pub mod params;
pub mod pg_helpers;
pub mod spec;
pub mod zenith;

View File

@@ -0,0 +1,43 @@
use std::io::Write;
use anyhow::Result;
use chrono::Utc;
use env_logger::{Builder, Env};
macro_rules! info_println {
($($tts:tt)*) => {
if log_enabled!(Level::Info) {
println!($($tts)*);
}
}
}
macro_rules! info_print {
($($tts:tt)*) => {
if log_enabled!(Level::Info) {
print!($($tts)*);
}
}
}
/// Initialize `env_logger` using either `default_level` or
/// `RUST_LOG` environment variable as default log level.
pub fn init_logger(default_level: &str) -> Result<()> {
let env = Env::default().filter_or("RUST_LOG", default_level);
Builder::from_env(env)
.format(|buf, record| {
let thread_handle = std::thread::current();
writeln!(
buf,
"{} [{}] {}: {}",
Utc::now().format("%Y-%m-%d %H:%M:%S%.3f %Z"),
thread_handle.name().unwrap_or("main"),
record.level(),
record.args()
)
})
.init();
Ok(())
}

View File

@@ -0,0 +1,109 @@
use std::sync::{Arc, RwLock};
use std::{thread, time};
use anyhow::Result;
use chrono::{DateTime, Utc};
use log::{debug, info};
use postgres::{Client, NoTls};
use crate::zenith::ComputeState;
const MONITOR_CHECK_INTERVAL: u64 = 500; // milliseconds
// Spin in a loop and figure out the last activity time in the Postgres.
// Then update it in the shared state. This function never errors out.
// XXX: the only expected panic is at `RwLock` unwrap().
fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
// Suppose that `connstr` doesn't change
let connstr = state.read().unwrap().connstr.clone();
// Define `client` outside of the loop to reuse existing connection if it's active.
let mut client = Client::connect(&connstr, NoTls);
let timeout = time::Duration::from_millis(MONITOR_CHECK_INTERVAL);
info!("watching Postgres activity at {}", connstr);
loop {
// Should be outside of the write lock to allow others to read while we sleep.
thread::sleep(timeout);
match &mut client {
Ok(cli) => {
if cli.is_closed() {
info!("connection to postgres closed, trying to reconnect");
// Connection is closed, reconnect and try again.
client = Client::connect(&connstr, NoTls);
continue;
}
// Get all running client backends except ourself, use RFC3339 DateTime format.
let backends = cli
.query(
"SELECT state, to_char(state_change, 'YYYY-MM-DD\"T\"HH24:MI:SS.US\"Z\"') AS state_change
FROM pg_stat_activity
WHERE backend_type = 'client backend'
AND pid != pg_backend_pid()
AND usename != 'zenith_admin';", // XXX: find a better way to filter other monitors?
&[],
);
let mut last_active = state.read().unwrap().last_active;
if let Ok(backs) = backends {
let mut idle_backs: Vec<DateTime<Utc>> = vec![];
for b in backs.into_iter() {
let state: String = b.get("state");
let change: String = b.get("state_change");
if state == "idle" {
let change = DateTime::parse_from_rfc3339(&change);
match change {
Ok(t) => idle_backs.push(t.with_timezone(&Utc)),
Err(e) => {
info!("cannot parse backend state_change DateTime: {}", e);
continue;
}
}
} else {
// Found non-idle backend, so the last activity is NOW.
// Save it and exit the for loop. Also clear the idle backend
// `state_change` timestamps array as it doesn't matter now.
last_active = Utc::now();
idle_backs.clear();
break;
}
}
// Sort idle backend `state_change` timestamps. The last one corresponds
// to the last activity.
idle_backs.sort();
if let Some(last) = idle_backs.last() {
last_active = *last;
}
}
// Update the last activity in the shared state if we got a more recent one.
let mut state = state.write().unwrap();
if last_active > state.last_active {
state.last_active = last_active;
debug!("set the last compute activity time to: {}", last_active);
}
}
Err(e) => {
info!("cannot connect to postgres: {}, retrying", e);
// Establish a new connection and try again.
client = Client::connect(&connstr, NoTls);
}
}
}
}
/// Launch a separate compute monitor thread and return its `JoinHandle`.
pub fn launch_monitor(state: &Arc<RwLock<ComputeState>>) -> Result<thread::JoinHandle<()>> {
let state = Arc::clone(state);
Ok(thread::Builder::new()
.name("compute-monitor".into())
.spawn(move || watch_compute_activity(&state))?)
}

View File

@@ -0,0 +1,3 @@
pub const DEFAULT_LOG_LEVEL: &str = "info";
pub const DEFAULT_CONNSTRING: &str = "host=localhost user=postgres";
pub const PG_HBA_ALL_MD5: &str = "host\tall\t\tall\t\t0.0.0.0/0\t\tmd5";

View File

@@ -0,0 +1,264 @@
use std::net::{SocketAddr, TcpStream};
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::Command;
use std::str::FromStr;
use std::{fs, thread, time};
use anyhow::{bail, Result};
use postgres::{Client, Transaction};
use serde::Deserialize;
const POSTGRES_WAIT_TIMEOUT: u64 = 60 * 1000; // milliseconds
/// Rust representation of Postgres role info with only those fields
/// that matter for us.
#[derive(Clone, Deserialize)]
pub struct Role {
pub name: PgIdent,
pub encrypted_password: Option<String>,
pub options: GenericOptions,
}
/// Rust representation of Postgres database info with only those fields
/// that matter for us.
#[derive(Clone, Deserialize)]
pub struct Database {
pub name: PgIdent,
pub owner: PgIdent,
pub options: GenericOptions,
}
/// Common type representing both SQL statement params with or without value,
/// like `LOGIN` or `OWNER username` in the `CREATE/ALTER ROLE`, and config
/// options like `wal_level = logical`.
#[derive(Clone, Deserialize)]
pub struct GenericOption {
pub name: String,
pub value: Option<String>,
pub vartype: String,
}
/// Optional collection of `GenericOption`'s. Type alias allows us to
/// declare a `trait` on it.
pub type GenericOptions = Option<Vec<GenericOption>>;
impl GenericOption {
/// Represent `GenericOption` as SQL statement parameter.
pub fn to_pg_option(&self) -> String {
if let Some(val) = &self.value {
match self.vartype.as_ref() {
"string" => format!("{} '{}'", self.name, val),
_ => format!("{} {}", self.name, val),
}
} else {
self.name.to_owned()
}
}
/// Represent `GenericOption` as configuration option.
pub fn to_pg_setting(&self) -> String {
if let Some(val) = &self.value {
match self.vartype.as_ref() {
"string" => format!("{} = '{}'", self.name, val),
_ => format!("{} = {}", self.name, val),
}
} else {
self.name.to_owned()
}
}
}
pub trait PgOptionsSerialize {
fn as_pg_options(&self) -> String;
fn as_pg_settings(&self) -> String;
}
impl PgOptionsSerialize for GenericOptions {
/// Serialize an optional collection of `GenericOption`'s to
/// Postgres SQL statement arguments.
fn as_pg_options(&self) -> String {
if let Some(ops) = &self {
ops.iter()
.map(|op| op.to_pg_option())
.collect::<Vec<String>>()
.join(" ")
} else {
"".to_string()
}
}
/// Serialize an optional collection of `GenericOption`'s to
/// `postgresql.conf` compatible format.
fn as_pg_settings(&self) -> String {
if let Some(ops) = &self {
ops.iter()
.map(|op| op.to_pg_setting())
.collect::<Vec<String>>()
.join("\n")
} else {
"".to_string()
}
}
}
pub trait GenericOptionsSearch {
fn find(&self, name: &str) -> Option<String>;
}
impl GenericOptionsSearch for GenericOptions {
/// Lookup option by name
fn find(&self, name: &str) -> Option<String> {
match &self {
Some(ops) => {
let op = ops.iter().find(|s| s.name == name);
match op {
Some(op) => op.value.clone(),
None => None,
}
}
None => None,
}
}
}
impl Role {
/// Serialize a list of role parameters into a Postgres-acceptable
/// string of arguments.
pub fn to_pg_options(&self) -> String {
// XXX: consider putting LOGIN as a default option somewhere higher, e.g. in Rails.
// For now we do not use generic `options` for roles. Once used, add
// `self.options.as_pg_options()` somewhere here.
let mut params: String = "LOGIN".to_string();
if let Some(pass) = &self.encrypted_password {
params.push_str(&format!(" PASSWORD 'md5{}'", pass));
} else {
params.push_str(" PASSWORD NULL");
}
params
}
}
impl Database {
/// Serialize a list of database parameters into a Postgres-acceptable
/// string of arguments.
/// NB: `TEMPLATE` is actually also an identifier, but so far we only need
/// to use `template0` and `template1`, so it is not a problem. Yet in the future
/// it may require a proper quoting too.
pub fn to_pg_options(&self) -> String {
let mut params: String = self.options.as_pg_options();
params.push_str(&format!(" OWNER {}", &self.owner.quote()));
params
}
}
/// String type alias representing Postgres identifier and
/// intended to be used for DB / role names.
pub type PgIdent = String;
/// Generic trait used to provide quoting for strings used in the
/// Postgres SQL queries. Currently used only to implement quoting
/// of identifiers, but could be used for literals in the future.
pub trait PgQuote {
fn quote(&self) -> String;
}
impl PgQuote for PgIdent {
/// This is intended to mimic Postgres quote_ident(), but for simplicity it
/// always quotes provided string with `""` and escapes every `"`. Not idempotent,
/// i.e. if string is already escaped it will be escaped again.
fn quote(&self) -> String {
let result = format!("\"{}\"", self.replace("\"", "\"\""));
result
}
}
/// Build a list of existing Postgres roles
pub fn get_existing_roles(xact: &mut Transaction<'_>) -> Result<Vec<Role>> {
let postgres_roles = xact
.query("SELECT rolname, rolpassword FROM pg_catalog.pg_authid", &[])?
.iter()
.map(|row| Role {
name: row.get("rolname"),
encrypted_password: row.get("rolpassword"),
options: None,
})
.collect();
Ok(postgres_roles)
}
/// Build a list of existing Postgres databases
pub fn get_existing_dbs(client: &mut Client) -> Result<Vec<Database>> {
let postgres_dbs = client
.query(
"SELECT datname, datdba::regrole::text as owner
FROM pg_catalog.pg_database;",
&[],
)?
.iter()
.map(|row| Database {
name: row.get("datname"),
owner: row.get("owner"),
options: None,
})
.collect();
Ok(postgres_dbs)
}
/// Wait for Postgres to become ready to accept connections:
/// - state should be `ready` in the `pgdata/postmaster.pid`
/// - and we should be able to connect to 127.0.0.1:5432
pub fn wait_for_postgres(port: &str, pgdata: &Path) -> Result<()> {
let pid_path = pgdata.join("postmaster.pid");
let mut slept: u64 = 0; // ms
let pause = time::Duration::from_millis(100);
let timeout = time::Duration::from_millis(200);
let addr = SocketAddr::from_str(&format!("127.0.0.1:{}", port)).unwrap();
loop {
// Sleep POSTGRES_WAIT_TIMEOUT at max (a bit longer actually if consider a TCP timeout,
// but postgres starts listening almost immediately, even if it is not really
// ready to accept connections).
if slept >= POSTGRES_WAIT_TIMEOUT {
bail!("timed out while waiting for Postgres to start");
}
if pid_path.exists() {
// XXX: dumb and the simplest way to get the last line in a text file
// TODO: better use `.lines().last()` later
let stdout = Command::new("tail")
.args(&["-n1", pid_path.to_str().unwrap()])
.output()?
.stdout;
let status = String::from_utf8(stdout)?;
let can_connect = TcpStream::connect_timeout(&addr, timeout).is_ok();
// Now Postgres is ready to accept connections
if status.trim() == "ready" && can_connect {
break;
}
}
thread::sleep(pause);
slept += 100;
}
Ok(())
}
/// Remove `pgdata` directory and create it again with right permissions.
pub fn create_pgdata(pgdata: &str) -> Result<()> {
// Ignore removal error, likely it is a 'No such file or directory (os error 2)'.
// If it is something different then create_dir() will error out anyway.
let _ok = fs::remove_dir_all(pgdata);
fs::create_dir(pgdata)?;
fs::set_permissions(pgdata, fs::Permissions::from_mode(0o700))?;
Ok(())
}

246
compute_tools/src/spec.rs Normal file
View File

@@ -0,0 +1,246 @@
use std::path::Path;
use anyhow::Result;
use log::{info, log_enabled, warn, Level};
use postgres::Client;
use crate::config;
use crate::params::PG_HBA_ALL_MD5;
use crate::pg_helpers::*;
use crate::zenith::ClusterSpec;
/// It takes cluster specification and does the following:
/// - Serialize cluster config and put it into `postgresql.conf` completely rewriting the file.
/// - Update `pg_hba.conf` to allow external connections.
pub fn handle_configuration(spec: &ClusterSpec, pgdata_path: &Path) -> Result<()> {
// File `postgresql.conf` is no longer included into `basebackup`, so just
// always write all config into it creating new file.
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Check `pg_hba.conf` and update if needed to allow external connections.
pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
// XXX: consider making it a part of spec.json
info!("checking pg_hba.conf");
let pghba_path = pgdata_path.join("pg_hba.conf");
if config::line_in_file(&pghba_path, PG_HBA_ALL_MD5)? {
info!("updated pg_hba.conf to allow external connections");
} else {
info!("pg_hba.conf is up-to-date");
}
Ok(())
}
/// Given a cluster spec json and open transaction it handles roles creation,
/// deletion and update.
pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
let mut xact = client.transaction()?;
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
// Print a list of existing Postgres roles (only in debug mode)
info!("postgres roles:");
for r in &existing_roles {
info_println!(
"{} - {}:{}",
" ".repeat(27 + 5),
r.name,
if r.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
}
// Process delta operations first
if let Some(ops) = &spec.delta_operations {
info!("processing delta operations on roles");
for op in ops {
match op.action.as_ref() {
// We do not check either role exists or not,
// Postgres will take care of it for us
"delete_role" => {
let query: String = format!("DROP ROLE IF EXISTS {}", &op.name.quote());
warn!("deleting role '{}'", &op.name);
xact.execute(query.as_str(), &[])?;
}
// Renaming role drops its password, since tole name is
// used as a salt there. It is important that this role
// is recorded with a new `name` in the `roles` list.
// Follow up roles update will set the new password.
"rename_role" => {
let new_name = op.new_name.as_ref().unwrap();
// XXX: with a limited number of roles it is fine, but consider making it a HashMap
if existing_roles.iter().any(|r| r.name == op.name) {
let query: String = format!(
"ALTER ROLE {} RENAME TO {}",
op.name.quote(),
new_name.quote()
);
warn!("renaming role '{}' to '{}'", op.name, new_name);
xact.execute(query.as_str(), &[])?;
}
}
_ => {}
}
}
}
// Refresh Postgres roles info to handle possible roles renaming
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
info!("cluster spec roles:");
for role in &spec.cluster.roles {
let name = &role.name;
info_print!(
"{} - {}:{}",
" ".repeat(27 + 5),
name,
if role.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
// XXX: with a limited number of roles it is fine, but consider making it a HashMap
let pg_role = existing_roles.iter().find(|r| r.name == *name);
if let Some(r) = pg_role {
let mut update_role = false;
if (r.encrypted_password.is_none() && role.encrypted_password.is_some())
|| (r.encrypted_password.is_some() && role.encrypted_password.is_none())
{
update_role = true;
} else if let Some(pg_pwd) = &r.encrypted_password {
// Check whether password changed or not (trim 'md5:' prefix first)
update_role = pg_pwd[3..] != *role.encrypted_password.as_ref().unwrap();
}
if update_role {
let mut query: String = format!("ALTER ROLE {} ", name.quote());
info_print!(" -> update");
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
}
} else {
info!("role name {}", &name);
let mut query: String = format!("CREATE ROLE {} ", name.quote());
info!("role create query {}", &query);
info_print!(" -> create");
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
}
info_print!("\n");
}
xact.commit()?;
Ok(())
}
/// It follows mostly the same logic as `handle_roles()` excepting that we
/// does not use an explicit transactions block, since major database operations
/// like `CREATE DATABASE` and `DROP DATABASE` do not support it. Statement-level
/// atomicity should be enough here due to the order of operations and various checks,
/// which together provide us idempotency.
pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
// Print a list of existing Postgres databases (only in debug mode)
info!("postgres databases:");
for r in &existing_dbs {
info_println!("{} - {}:{}", " ".repeat(27 + 5), r.name, r.owner);
}
// Process delta operations first
if let Some(ops) = &spec.delta_operations {
info!("processing delta operations on databases");
for op in ops {
match op.action.as_ref() {
// We do not check either DB exists or not,
// Postgres will take care of it for us
"delete_db" => {
let query: String = format!("DROP DATABASE IF EXISTS {}", &op.name.quote());
warn!("deleting database '{}'", &op.name);
client.execute(query.as_str(), &[])?;
}
"rename_db" => {
let new_name = op.new_name.as_ref().unwrap();
// XXX: with a limited number of roles it is fine, but consider making it a HashMap
if existing_dbs.iter().any(|r| r.name == op.name) {
let query: String = format!(
"ALTER DATABASE {} RENAME TO {}",
op.name.quote(),
new_name.quote()
);
warn!("renaming database '{}' to '{}'", op.name, new_name);
client.execute(query.as_str(), &[])?;
}
}
_ => {}
}
}
}
// Refresh Postgres databases info to handle possible renames
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
info!("cluster spec databases:");
for db in &spec.cluster.databases {
let name = &db.name;
info_print!("{} - {}:{}", " ".repeat(27 + 5), db.name, db.owner);
// XXX: with a limited number of databases it is fine, but consider making it a HashMap
let pg_db = existing_dbs.iter().find(|r| r.name == *name);
if let Some(r) = pg_db {
// XXX: db owner name is returned as quoted string from Postgres,
// when quoting is needed.
let new_owner = if r.owner.starts_with('\"') {
db.owner.quote()
} else {
db.owner.clone()
};
if new_owner != r.owner {
let query: String = format!(
"ALTER DATABASE {} OWNER TO {}",
name.quote(),
db.owner.quote()
);
info_print!(" -> update");
client.execute(query.as_str(), &[])?;
}
} else {
let mut query: String = format!("CREATE DATABASE {} ", name.quote());
info_print!(" -> create");
query.push_str(&db.to_pg_options());
client.execute(query.as_str(), &[])?;
}
info_print!("\n");
}
Ok(())
}

109
compute_tools/src/zenith.rs Normal file
View File

@@ -0,0 +1,109 @@
use std::process::{Command, Stdio};
use anyhow::Result;
use chrono::{DateTime, Utc};
use postgres::{Client, NoTls};
use serde::Deserialize;
use crate::pg_helpers::*;
/// Compute node state shared across several `zenith_ctl` threads.
/// Should be used under `RwLock` to allow HTTP API server to serve
/// status requests, while configuration is in progress.
pub struct ComputeState {
pub connstr: String,
pub pgdata: String,
pub pgbin: String,
pub spec: ClusterSpec,
/// Compute setup process has finished
pub ready: bool,
/// Timestamp of the last Postgres activity
pub last_active: DateTime<Utc>,
}
/// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description.
#[derive(Clone, Deserialize)]
pub struct ClusterSpec {
pub format_version: f32,
pub timestamp: String,
pub operation_uuid: Option<String>,
/// Expected cluster state at the end of transition process.
pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>,
}
/// Cluster state seen from the perspective of the external tools
/// like Rails web console.
#[derive(Clone, Deserialize)]
pub struct Cluster {
pub cluster_id: String,
pub name: String,
pub state: Option<String>,
pub roles: Vec<Role>,
pub databases: Vec<Database>,
pub settings: GenericOptions,
}
/// Single cluster state changing operation that could not be represented as
/// a static `Cluster` structure. For example:
/// - DROP DATABASE
/// - DROP ROLE
/// - ALTER ROLE name RENAME TO new_name
/// - ALTER DATABASE name RENAME TO new_name
#[derive(Clone, Deserialize)]
pub struct DeltaOp {
pub action: String,
pub name: PgIdent,
pub new_name: Option<PgIdent>,
}
/// Get basebackup from the libpq connection to pageserver using `connstr` and
/// unarchive it to `pgdata` directory overriding all its previous content.
pub fn get_basebackup(
pgdata: &str,
connstr: &str,
tenant: &str,
timeline: &str,
lsn: &str,
) -> Result<()> {
let mut client = Client::connect(connstr, NoTls)?;
let basebackup_cmd = match lsn {
"0/0" => format!("basebackup {} {}", tenant, timeline), // First start of the compute
_ => format!("basebackup {} {} {}", tenant, timeline, lsn),
};
let copyreader = client.copy_out(basebackup_cmd.as_str())?;
let mut ar = tar::Archive::new(copyreader);
ar.unpack(&pgdata)?;
Ok(())
}
/// Run `postgres` in a special mode with `--sync-safekeepers` argument
/// and return the reported LSN back to the caller.
pub fn sync_safekeepers(pgdata: &str, pgbin: &str) -> Result<String> {
let sync_handle = Command::new(&pgbin)
.args(&["--sync-safekeepers"])
.env("PGDATA", &pgdata) // we cannot use -D in this mode
.stdout(Stdio::piped())
.spawn()
.expect("postgres --sync-safekeepers failed to start");
// `postgres --sync-safekeepers` will print all log output to stderr and
// final LSN to stdout. So we pipe only stdout, while stderr will be automatically
// redirected to the caller output.
let sync_output = sync_handle
.wait_with_output()
.expect("postgres --sync-safekeepers failed");
if !sync_output.status.success() {
anyhow::bail!(
"postgres --sync-safekeepers exited with non-zero status: {}",
sync_output.status,
);
}
let lsn = String::from(String::from_utf8(sync_output.stdout)?.trim());
Ok(lsn)
}

View File

@@ -0,0 +1,205 @@
{
"format_version": 1.0,
"timestamp": "2021-05-23T18:25:43.511Z",
"operation_uuid": "0f657b36-4b0f-4a2d-9c2e-1dcd615e7d8b",
"cluster": {
"cluster_id": "test-cluster-42",
"name": "Zenith Test",
"state": "restarted",
"roles": [
{
"name": "postgres",
"encrypted_password": "6b1d16b78004bbd51fa06af9eda75972",
"options": null
},
{
"name": "alexk",
"encrypted_password": null,
"options": null
},
{
"name": "zenith \"new\"",
"encrypted_password": "5b1d16b78004bbd51fa06af9eda75972",
"options": null
},
{
"name": "zen",
"encrypted_password": "9b1d16b78004bbd51fa06af9eda75972"
},
{
"name": "\"name\";\\n select 1;",
"encrypted_password": "5b1d16b78004bbd51fa06af9eda75972"
},
{
"name": "MyRole",
"encrypted_password": "5b1d16b78004bbd51fa06af9eda75972"
}
],
"databases": [
{
"name": "DB2",
"owner": "alexk",
"options": [
{
"name": "LC_COLLATE",
"value": "C",
"vartype": "string"
},
{
"name": "LC_CTYPE",
"value": "C",
"vartype": "string"
},
{
"name": "TEMPLATE",
"value": "template0",
"vartype": "enum"
}
]
},
{
"name": "zenith",
"owner": "MyRole"
},
{
"name": "zen",
"owner": "zen"
}
],
"settings": [
{
"name": "fsync",
"value": "off",
"vartype": "bool"
},
{
"name": "wal_level",
"value": "replica",
"vartype": "enum"
},
{
"name": "hot_standby",
"value": "on",
"vartype": "bool"
},
{
"name": "wal_acceptors",
"value": "127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501",
"vartype": "string"
},
{
"name": "wal_log_hints",
"value": "on",
"vartype": "bool"
},
{
"name": "log_connections",
"value": "on",
"vartype": "bool"
},
{
"name": "shared_buffers",
"value": "32768",
"vartype": "integer"
},
{
"name": "port",
"value": "55432",
"vartype": "integer"
},
{
"name": "max_connections",
"value": "100",
"vartype": "integer"
},
{
"name": "max_wal_senders",
"value": "10",
"vartype": "integer"
},
{
"name": "listen_addresses",
"value": "0.0.0.0",
"vartype": "string"
},
{
"name": "wal_sender_timeout",
"value": "0",
"vartype": "integer"
},
{
"name": "password_encryption",
"value": "md5",
"vartype": "enum"
},
{
"name": "maintenance_work_mem",
"value": "65536",
"vartype": "integer"
},
{
"name": "max_parallel_workers",
"value": "8",
"vartype": "integer"
},
{
"name": "max_worker_processes",
"value": "8",
"vartype": "integer"
},
{
"name": "zenith.zenith_tenant",
"value": "b0554b632bd4d547a63b86c3630317e8",
"vartype": "string"
},
{
"name": "max_replication_slots",
"value": "10",
"vartype": "integer"
},
{
"name": "zenith.zenith_timeline",
"value": "2414a61ffc94e428f14b5758fe308e13",
"vartype": "string"
},
{
"name": "shared_preload_libraries",
"value": "zenith",
"vartype": "string"
},
{
"name": "synchronous_standby_names",
"value": "walproposer",
"vartype": "string"
},
{
"name": "zenith.page_server_connstring",
"value": "host=127.0.0.1 port=6400",
"vartype": "string"
}
]
},
"delta_operations": [
{
"action": "delete_db",
"name": "zenith_test"
},
{
"action": "rename_db",
"name": "DB",
"new_name": "DB2"
},
{
"action": "delete_role",
"name": "zenith2"
},
{
"action": "rename_role",
"name": "zenith new",
"new_name": "zenith \"new\""
}
]
}

View File

@@ -0,0 +1,48 @@
#[cfg(test)]
mod config_tests {
use std::fs::{remove_file, File};
use std::io::{Read, Write};
use std::path::Path;
use compute_tools::config::*;
fn write_test_file(path: &Path, content: &str) {
let mut file = File::create(path).unwrap();
file.write_all(content.as_bytes()).unwrap();
}
fn check_file_content(path: &Path, expected_content: &str) {
let mut file = File::open(path).unwrap();
let mut content = String::new();
file.read_to_string(&mut content).unwrap();
assert_eq!(content, expected_content);
}
#[test]
fn test_line_in_file() {
let path = Path::new("./tests/tmp/config_test.txt");
write_test_file(path, "line1\nline2.1\t line2.2\nline3");
let line = "line2.1\t line2.2";
let result = line_in_file(path, line).unwrap();
assert!(!result);
check_file_content(path, "line1\nline2.1\t line2.2\nline3");
let line = "line4";
let result = line_in_file(path, line).unwrap();
assert!(result);
check_file_content(path, "line1\nline2.1\t line2.2\nline3\nline4");
remove_file(path).unwrap();
let path = Path::new("./tests/tmp/new_config_test.txt");
let line = "line4";
let result = line_in_file(path, line).unwrap();
assert!(result);
check_file_content(path, "line4");
remove_file(path).unwrap();
}
}

View File

@@ -0,0 +1,41 @@
#[cfg(test)]
mod pg_helpers_tests {
use std::fs::File;
use compute_tools::pg_helpers::*;
use compute_tools::zenith::ClusterSpec;
#[test]
fn params_serialize() {
let file = File::open("tests/cluster_spec.json").unwrap();
let spec: ClusterSpec = serde_json::from_reader(file).unwrap();
assert_eq!(
spec.cluster.databases.first().unwrap().to_pg_options(),
"LC_COLLATE 'C' LC_CTYPE 'C' TEMPLATE template0 OWNER \"alexk\""
);
assert_eq!(
spec.cluster.roles.first().unwrap().to_pg_options(),
"LOGIN PASSWORD 'md56b1d16b78004bbd51fa06af9eda75972'"
);
}
#[test]
fn settings_serialize() {
let file = File::open("tests/cluster_spec.json").unwrap();
let spec: ClusterSpec = serde_json::from_reader(file).unwrap();
assert_eq!(
spec.cluster.settings.as_pg_settings(),
"fsync = off\nwal_level = replica\nhot_standby = on\nwal_acceptors = '127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501'\nwal_log_hints = on\nlog_connections = on\nshared_buffers = 32768\nport = 55432\nmax_connections = 100\nmax_wal_senders = 10\nlisten_addresses = '0.0.0.0'\nwal_sender_timeout = 0\npassword_encryption = md5\nmaintenance_work_mem = 65536\nmax_parallel_workers = 8\nmax_worker_processes = 8\nzenith.zenith_tenant = 'b0554b632bd4d547a63b86c3630317e8'\nmax_replication_slots = 10\nzenith.zenith_timeline = '2414a61ffc94e428f14b5758fe308e13'\nshared_preload_libraries = 'zenith'\nsynchronous_standby_names = 'walproposer'\nzenith.page_server_connstring = 'host=127.0.0.1 port=6400'"
);
}
#[test]
fn quote_ident() {
let ident: PgIdent = PgIdent::from("\"name\";\\n select 1;");
assert_eq!(ident.quote(), "\"\"\"name\"\";\\n select 1;\"");
}
}

1
compute_tools/tests/tmp/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
**/*

View File

@@ -1,30 +1,21 @@
[package]
name = "control_plane"
version = "0.1.0"
authors = ["Stas Kelvich <stas@zenith.tech>"]
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
edition = "2021"
[dependencies]
rand = "0.8.3"
tar = "0.4.33"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
toml = "0.5"
lazy_static = "1.4"
regex = "1"
anyhow = "1.0"
thiserror = "1"
bytes = "1.0.1"
nix = "0.20"
nix = "0.23"
url = "2.2.2"
hex = { version = "0.4.3", features = ["serde"] }
reqwest = { version = "0.11", features = ["blocking", "json"] }
reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
pageserver = { path = "../pageserver" }
walkeeper = { path = "../walkeeper" }
postgres_ffi = { path = "../postgres_ffi" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { path = "../workspace_hack" }

View File

@@ -0,0 +1,20 @@
# Page server and three safekeepers.
[pageserver]
listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
auth_type = 'Trust'
[[safekeepers]]
id = 1
pg_port = 5454
http_port = 7676
[[safekeepers]]
id = 2
pg_port = 5455
http_port = 7677
[[safekeepers]]
id = 3
pg_port = 5456
http_port = 7678

11
control_plane/simple.conf Normal file
View File

@@ -0,0 +1,11 @@
# Minimal zenith environment with one safekeeper. This is equivalent to the built-in
# defaults that you get with no --config
[pageserver]
listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
auth_type = 'Trust'
[[safekeepers]]
id = 1
pg_port = 5454
http_port = 7676

View File

@@ -1,17 +1,16 @@
use std::fs::{self, File, OpenOptions};
use std::collections::BTreeMap;
use std::fs::{self, File};
use std::io::Write;
use std::net::SocketAddr;
use std::net::TcpStream;
use std::os::unix::fs::PermissionsExt;
use std::path::PathBuf;
use std::process::{Command, Stdio};
use std::str::FromStr;
use std::sync::Arc;
use std::time::Duration;
use std::{collections::BTreeMap, path::PathBuf};
use anyhow::{Context, Result};
use lazy_static::lazy_static;
use regex::Regex;
use zenith_utils::connstring::connection_host_port;
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::AuthType;
@@ -19,6 +18,7 @@ use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
use crate::local_env::LocalEnv;
use crate::postgresql_conf::PostgresConf;
use crate::storage::PageServerNode;
//
@@ -39,8 +39,6 @@ impl ComputeControlPlane {
// | |- <tenant_id>
// | | |- <branch name>
pub fn load(env: LocalEnv) -> Result<ComputeControlPlane> {
// TODO: since pageserver do not have config file yet we believe here that
// it is running on default port. Change that when pageserver will have config.
let pageserver = Arc::new(PageServerNode::from_env(&env));
let mut nodes = BTreeMap::default();
@@ -75,40 +73,55 @@ impl ComputeControlPlane {
.unwrap_or(self.base_port)
}
pub fn local(local_env: &LocalEnv, pageserver: &Arc<PageServerNode>) -> ComputeControlPlane {
ComputeControlPlane {
base_port: 65431,
pageserver: Arc::clone(pageserver),
nodes: BTreeMap::new(),
env: local_env.clone(),
}
// FIXME: see also parse_point_in_time in branches.rs.
fn parse_point_in_time(
&self,
tenantid: ZTenantId,
s: &str,
) -> Result<(ZTimelineId, Option<Lsn>)> {
let mut strings = s.split('@');
let name = strings.next().unwrap();
let lsn = strings
.next()
.map(Lsn::from_str)
.transpose()
.context("invalid LSN in point-in-time specification")?;
// Resolve the timeline ID, given the human-readable branch name
let timeline_id = self
.pageserver
.branch_get_by_name(&tenantid, name)?
.timeline_id;
Ok((timeline_id, lsn))
}
pub fn new_node(
&mut self,
tenantid: ZTenantId,
branch_name: &str,
name: &str,
timeline_spec: &str,
port: Option<u16>,
) -> Result<Arc<PostgresNode>> {
let timeline_id = self
.pageserver
.branch_get_by_name(&tenantid, branch_name)?
.timeline_id;
// Resolve the human-readable timeline spec into timeline ID and LSN
let (timelineid, lsn) = self.parse_point_in_time(tenantid, timeline_spec)?;
let port = port.unwrap_or_else(|| self.get_port());
let node = Arc::new(PostgresNode {
name: branch_name.to_owned(),
name: name.to_owned(),
address: SocketAddr::new("127.0.0.1".parse().unwrap(), port),
env: self.env.clone(),
pageserver: Arc::clone(&self.pageserver),
is_test: false,
timelineid: timeline_id,
timelineid,
lsn,
tenantid,
uses_wal_proposer: false,
});
node.create_pgdata()?;
node.setup_pg_conf(self.env.auth_type)?;
node.setup_pg_conf(self.env.pageserver.auth_type)?;
self.nodes
.insert((tenantid, node.name.clone()), Arc::clone(&node));
@@ -127,6 +140,7 @@ pub struct PostgresNode {
pageserver: Arc<PageServerNode>,
is_test: bool,
pub timelineid: ZTimelineId,
pub lsn: Option<Lsn>, // if it's a read-only node. None for primary
pub tenantid: ZTenantId,
uses_wal_proposer: bool,
}
@@ -144,76 +158,28 @@ impl PostgresNode {
);
}
lazy_static! {
static ref CONF_PORT_RE: Regex = Regex::new(r"(?m)^\s*port\s*=\s*(\d+)\s*$").unwrap();
static ref CONF_TIMELINE_RE: Regex =
Regex::new(r"(?m)^\s*zenith.zenith_timeline\s*=\s*'(\w+)'\s*$").unwrap();
static ref CONF_TENANT_RE: Regex =
Regex::new(r"(?m)^\s*zenith.zenith_tenant\s*=\s*'(\w+)'\s*$").unwrap();
}
// parse data directory name
let fname = entry.file_name();
let name = fname.to_str().unwrap().to_string();
// find out tcp port in config file
// Read config file into memory
let cfg_path = entry.path().join("postgresql.conf");
let config = fs::read_to_string(cfg_path.clone()).with_context(|| {
format!(
"failed to read config file in {}",
cfg_path.to_str().unwrap()
)
})?;
let cfg_path_str = cfg_path.to_string_lossy();
let mut conf_file = File::open(&cfg_path)
.with_context(|| format!("failed to open config file in {}", cfg_path_str))?;
let conf = PostgresConf::read(&mut conf_file)
.with_context(|| format!("failed to read config file in {}", cfg_path_str))?;
// parse port
let err_msg = format!(
"failed to find port definition in config file {}",
cfg_path.to_str().unwrap()
);
let port: u16 = CONF_PORT_RE
.captures(config.as_str())
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 1"))?
.iter()
.last()
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 2"))?
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 3"))?
.as_str()
.parse()
.with_context(|| err_msg)?;
// Read a few options from the config file
let context = format!("in config file {}", cfg_path_str);
let port: u16 = conf.parse_field("port", &context)?;
let timelineid: ZTimelineId = conf.parse_field("zenith.zenith_timeline", &context)?;
let tenantid: ZTenantId = conf.parse_field("zenith.zenith_tenant", &context)?;
let uses_wal_proposer = conf.get("wal_acceptors").is_some();
// parse timeline
let err_msg = format!(
"failed to find timeline definition in config file {}",
cfg_path.to_str().unwrap()
);
let timelineid: ZTimelineId = CONF_TIMELINE_RE
.captures(config.as_str())
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 1"))?
.iter()
.last()
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 2"))?
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 3"))?
.as_str()
.parse()
.with_context(|| err_msg)?;
// parse tenant
let err_msg = format!(
"failed to find tenant definition in config file {}",
cfg_path.to_str().unwrap()
);
let tenantid = CONF_TENANT_RE
.captures(config.as_str())
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 1"))?
.iter()
.last()
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 2"))?
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 3"))?
.as_str()
.parse()
.with_context(|| err_msg)?;
let uses_wal_proposer = config.contains("wal_acceptors");
// parse recovery_target_lsn, if any
let recovery_target_lsn: Option<Lsn> =
conf.parse_field_optional("recovery_target_lsn", &context)?;
// ok now
Ok(PostgresNode {
@@ -223,22 +189,30 @@ impl PostgresNode {
pageserver: Arc::clone(pageserver),
is_test: false,
timelineid,
lsn: recovery_target_lsn,
tenantid,
uses_wal_proposer,
})
}
fn sync_walkeepers(&self) -> Result<Lsn> {
fn sync_safekeepers(&self, auth_token: &Option<String>) -> Result<Lsn> {
let pg_path = self.env.pg_bin_dir().join("postgres");
let sync_handle = Command::new(pg_path)
.arg("--sync-safekeepers")
let mut cmd = Command::new(&pg_path);
cmd.arg("--sync-safekeepers")
.env_clear()
.env("LD_LIBRARY_PATH", self.env.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", self.env.pg_lib_dir().to_str().unwrap())
.env("PGDATA", self.pgdata().to_str().unwrap())
.stdout(Stdio::piped())
// Comment this to avoid capturing stderr (useful if command hangs)
.stderr(Stdio::piped())
.stderr(Stdio::piped());
if let Some(token) = auth_token {
cmd.env("ZENITH_AUTH_TOKEN", token);
}
let sync_handle = cmd
.spawn()
.expect("postgres --sync-safekeepers failed to start");
@@ -253,7 +227,7 @@ impl PostgresNode {
}
let lsn = Lsn::from_str(std::str::from_utf8(&sync_output.stdout)?.trim())?;
println!("Walkeepers synced on {}", lsn);
println!("Safekeepers synced on {}", lsn);
Ok(lsn)
}
@@ -275,16 +249,16 @@ impl PostgresNode {
let mut client = self
.pageserver
.page_server_psql_client()
.with_context(|| "connecting to page server failed")?;
.context("connecting to page server failed")?;
let copyreader = client
.copy_out(sql.as_str())
.with_context(|| "page server 'basebackup' command failed")?;
.context("page server 'basebackup' command failed")?;
// Read the archive directly from the `CopyOutReader`
tar::Archive::new(copyreader)
.unpack(&self.pgdata())
.with_context(|| "extracting page backup failed")?;
.context("extracting base backup failed")?;
Ok(())
}
@@ -308,85 +282,117 @@ impl PostgresNode {
// Connect to a page server, get base backup, and untar it to initialize a
// new data directory
fn setup_pg_conf(&self, auth_type: AuthType) -> Result<()> {
File::create(self.pgdata().join("postgresql.conf").to_str().unwrap())?;
let mut conf = PostgresConf::new();
conf.append("max_wal_senders", "10");
// wal_log_hints is mandatory when running against pageserver (see gh issue#192)
// TODO: is it possible to check wal_log_hints at pageserver side via XLOG_PARAMETER_CHANGE?
self.append_conf(
"postgresql.conf",
&format!(
"max_wal_senders = 10\n\
wal_log_hints = on\n\
max_replication_slots = 10\n\
hot_standby = on\n\
shared_buffers = 1MB\n\
fsync = off\n\
max_connections = 100\n\
wal_sender_timeout = 0\n\
wal_level = replica\n\
zenith.file_cache_size = 4096\n\
zenith.file_cache_path = '/tmp/file.cache'\n\
listen_addresses = '{address}'\n\
port = {port}\n",
address = self.address.ip(),
port = self.address.port()
),
)?;
conf.append("wal_log_hints", "on");
conf.append("max_replication_slots", "10");
conf.append("hot_standby", "on");
conf.append("shared_buffers", "1MB");
conf.append("fsync", "off");
conf.append("max_connections", "100");
conf.append("wal_level", "replica");
// wal_sender_timeout is the maximum time to wait for WAL replication.
// It also defines how often the walreciever will send a feedback message to the wal sender.
conf.append("wal_sender_timeout", "5s");
conf.append("listen_addresses", &self.address.ip().to_string());
conf.append("port", &self.address.port().to_string());
// Never clean up old WAL. TODO: We should use a replication
// slot or something proper, to prevent the compute node
// from removing WAL that hasn't been streamed to the safekeeper or
// page server yet. (gh issue #349)
self.append_conf("postgresql.conf", "wal_keep_size='10TB'\n")?;
conf.append("wal_keep_size", "10TB");
// set up authentication
let password = if let AuthType::ZenithJWT = auth_type {
"$ZENITH_AUTH_TOKEN"
} else {
""
// Configure the node to fetch pages from pageserver
let pageserver_connstr = {
let (host, port) = connection_host_port(&self.pageserver.pg_connection_config);
// Set up authentication
//
// $ZENITH_AUTH_TOKEN will be replaced with value from environment
// variable during compute pg startup. It is done this way because
// otherwise user will be able to retrieve the value using SHOW
// command or pg_settings
let password = if let AuthType::ZenithJWT = auth_type {
"$ZENITH_AUTH_TOKEN"
} else {
""
};
// NOTE avoiding spaces in connection string, because it is less error prone if we forward it somewhere.
// Also note that not all parameters are supported here. Because in compute we substitute $ZENITH_AUTH_TOKEN
// We parse this string and build it back with token from env var, and for simplicity rebuild
// uses only needed variables namely host, port, user, password.
format!("postgresql://no_user:{}@{}:{}", password, host, port)
};
conf.append("shared_preload_libraries", "zenith");
conf.append_line("");
conf.append("zenith.page_server_connstring", &pageserver_connstr);
conf.append("zenith.zenith_tenant", &self.tenantid.to_string());
conf.append("zenith.zenith_timeline", &self.timelineid.to_string());
if let Some(lsn) = self.lsn {
conf.append("recovery_target_lsn", &lsn.to_string());
}
// Configure that node to take pages from pageserver
let (host, port) = connection_host_port(&self.pageserver.pg_connection_config);
self.append_conf(
"postgresql.conf",
format!(
concat!(
"shared_preload_libraries = zenith\n",
// $ZENITH_AUTH_TOKEN will be replaced with value from environment variable during compute pg startup
// it is done this way because otherwise user will be able to retrieve the value using SHOW command or pg_settings
"zenith.page_server_connstring = 'host={} port={} password={}'\n",
"zenith.zenith_timeline='{}'\n",
"zenith.zenith_tenant='{}'\n",
),
host, port, password, self.timelineid, self.tenantid,
)
.as_str(),
)?;
conf.append_line("");
// Configure backpressure
// - Replication write lag depends on how fast the walreceiver can process incoming WAL.
// This lag determines latency of get_page_at_lsn. Speed of applying WAL is about 10MB/sec,
// so to avoid expiration of 1 minute timeout, this lag should not be larger than 600MB.
// Actually latency should be much smaller (better if < 1sec). But we assume that recently
// updates pages are not requested from pageserver.
// - Replication flush lag depends on speed of persisting data by checkpointer (creation of
// delta/image layers) and advancing disk_consistent_lsn. Safekeepers are able to
// remove/archive WAL only beyond disk_consistent_lsn. Too large a lag can cause long
// recovery time (in case of pageserver crash) and disk space overflow at safekeepers.
// - Replication apply lag depends on speed of uploading changes to S3 by uploader thread.
// To be able to restore database in case of pageserver node crash, safekeeper should not
// remove WAL beyond this point. Too large lag can cause space exhaustion in safekeepers
// (if they are not able to upload WAL to S3).
conf.append("max_replication_write_lag", "500MB");
conf.append("max_replication_flush_lag", "10GB");
// Configure the node to stream WAL directly to the pageserver
self.append_conf(
"postgresql.conf",
format!(
concat!(
"synchronous_standby_names = 'pageserver'\n", // TODO: add a new function arg?
"zenith.callmemaybe_connstring = '{}'\n", // FIXME escaping
),
self.connstr(),
)
.as_str(),
)?;
if !self.env.safekeepers.is_empty() {
// Configure the node to connect to the safekeepers
conf.append("synchronous_standby_names", "walproposer");
let wal_acceptors = self
.env
.safekeepers
.iter()
.map(|sk| format!("localhost:{}", sk.pg_port))
.collect::<Vec<String>>()
.join(",");
conf.append("wal_acceptors", &wal_acceptors);
} else {
// We only use setup without safekeepers for tests,
// and don't care about data durability on pageserver,
// so set more relaxed synchronous_commit.
conf.append("synchronous_commit", "remote_write");
// Configure the node to stream WAL directly to the pageserver
// This isn't really a supported configuration, but can be useful for
// testing.
conf.append("synchronous_standby_names", "pageserver");
conf.append("zenith.callmemaybe_connstring", &self.connstr());
}
let mut file = File::create(self.pgdata().join("postgresql.conf"))?;
file.write_all(conf.to_string().as_bytes())?;
Ok(())
}
fn load_basebackup(&self) -> Result<()> {
let lsn = if self.uses_wal_proposer {
fn load_basebackup(&self, auth_token: &Option<String>) -> Result<()> {
let backup_lsn = if let Some(lsn) = self.lsn {
Some(lsn)
} else if self.uses_wal_proposer {
// LSN 0 means that it is bootstrap and we need to download just
// latest data from the pageserver. That is a bit clumsy but whole bootstrap
// procedure evolves quite actively right now, so let's think about it again
// when things would be more stable (TODO).
let lsn = self.sync_walkeepers()?;
let lsn = self.sync_safekeepers(auth_token)?;
if lsn == Lsn(0) {
None
} else {
@@ -396,7 +402,7 @@ impl PostgresNode {
None
};
self.do_basebackup(lsn)?;
self.do_basebackup(backup_lsn)?;
Ok(())
}
@@ -418,14 +424,6 @@ impl PostgresNode {
}
}
pub fn append_conf(&self, config: &str, opts: &str) -> Result<()> {
OpenOptions::new()
.append(true)
.open(self.pgdata().join(config).to_str().unwrap())?
.write_all(opts.as_bytes())?;
Ok(())
}
fn pg_ctl(&self, args: &[&str], auth_token: &Option<String>) -> Result<()> {
let pg_ctl_path = self.env.pg_bin_dir().join("pg_ctl");
let mut cmd = Command::new(pg_ctl_path);
@@ -445,11 +443,10 @@ impl PostgresNode {
.env_clear()
.env("LD_LIBRARY_PATH", self.env.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", self.env.pg_lib_dir().to_str().unwrap());
if let Some(token) = auth_token {
cmd.env("ZENITH_AUTH_TOKEN", token);
}
let pg_ctl = cmd.status().with_context(|| "pg_ctl failed")?;
let pg_ctl = cmd.status().context("pg_ctl failed")?;
if !pg_ctl.success() {
anyhow::bail!("pg_ctl failed");
@@ -479,7 +476,11 @@ impl PostgresNode {
fs::write(&postgresql_conf_path, postgresql_conf)?;
// 3. Load basebackup
self.load_basebackup()?;
self.load_basebackup(auth_token)?;
if self.lsn.is_some() {
File::create(self.pgdata().join("standby.signal"))?;
}
// 4. Finally start the compute node postgres
println!("Starting postgres node at '{}'", self.connstr());
@@ -527,9 +528,7 @@ impl PostgresNode {
.output()
.expect("failed to execute whoami");
if !output.status.success() {
panic!("whoami failed");
}
assert!(output.status.success(), "whoami failed");
String::from_utf8(output.stdout).unwrap().trim().to_string()
}

View File

@@ -9,9 +9,12 @@
use anyhow::{anyhow, bail, Context, Result};
use std::fs;
use std::path::Path;
use std::process::Command;
pub mod compute;
pub mod local_env;
pub mod postgresql_conf;
pub mod safekeeper;
pub mod storage;
/// Read a PID file
@@ -29,3 +32,19 @@ pub fn read_pidfile(pidfile: &Path) -> Result<i32> {
}
Ok(pid)
}
fn fill_rust_env_vars(cmd: &mut Command) -> &mut Command {
let cmd = cmd.env_clear().env("RUST_BACKTRACE", "1");
let var = "LLVM_PROFILE_FILE";
if let Some(val) = std::env::var_os(var) {
cmd.env(var, val);
}
const RUST_LOG_KEY: &str = "RUST_LOG";
if let Ok(rust_log_value) = std::env::var(RUST_LOG_KEY) {
cmd.env(RUST_LOG_KEY, rust_log_value)
} else {
cmd
}
}

View File

@@ -1,52 +1,112 @@
//
// This module is responsible for locating and loading paths in a local setup.
//
// Now it also provides init method which acts like a stub for proper installation
// script which will use local paths.
//
use anyhow::{Context, Result};
//! This module is responsible for locating and loading paths in a local setup.
//!
//! Now it also provides init method which acts like a stub for proper installation
//! script which will use local paths.
use anyhow::{bail, Context};
use serde::{Deserialize, Serialize};
use std::env;
use std::fmt::Write;
use std::fs;
use std::path::PathBuf;
use std::path::{Path, PathBuf};
use std::process::{Command, Stdio};
use zenith_utils::auth::{encode_from_key_path, Claims, Scope};
use zenith_utils::auth::{encode_from_key_file, Claims, Scope};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::{opt_display_serde, ZNodeId, ZTenantId};
use crate::safekeeper::SafekeeperNode;
//
// This data structures represent deserialized zenith CLI config
// This data structures represents zenith CLI config
//
// It is deserialized from the .zenith/config file, or the config file passed
// to 'zenith init --config=<path>' option. See control_plane/simple.conf for
// an example.
//
#[derive(Serialize, Deserialize, Clone, Debug)]
pub struct LocalEnv {
// Pageserver connection settings
pub pageserver_pg_port: u16,
pub pageserver_http_port: u16,
// Base directory for both pageserver and compute nodes
// Base directory for all the nodes (the pageserver, safekeepers and
// compute nodes).
//
// This is not stored in the config file. Rather, this is the path where the
// config file itself is. It is read from the ZENITH_REPO_DIR env variable or
// '.zenith' if not given.
#[serde(skip)]
pub base_data_dir: PathBuf,
// Path to postgres distribution. It's expected that "bin", "include",
// "lib", "share" from postgres distribution are there. If at some point
// in time we will be able to run against vanilla postgres we may split that
// to four separate paths and match OS-specific installation layout.
#[serde(default)]
pub pg_distrib_dir: PathBuf,
// Path to pageserver binary.
#[serde(default)]
pub zenith_distrib_dir: PathBuf,
// keeping tenant id in config to reduce copy paste when running zenith locally with single tenant
#[serde(with = "hex")]
pub tenantid: ZTenantId,
// Default tenant ID to use with the 'zenith' command line utility, when
// --tenantid is not explicitly specified.
#[serde(with = "opt_display_serde")]
#[serde(default)]
pub default_tenantid: Option<ZTenantId>,
// jwt auth token used for communication with pageserver
pub auth_token: String,
// used to issue tokens during e.g pg start
#[serde(default)]
pub private_key_path: PathBuf,
pub pageserver: PageServerConf,
#[serde(default)]
pub safekeepers: Vec<SafekeeperConf>,
}
#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(default)]
pub struct PageServerConf {
// node id
pub id: ZNodeId,
// Pageserver connection settings
pub listen_pg_addr: String,
pub listen_http_addr: String,
// used to determine which auth type is used
pub auth_type: AuthType,
// used to issue tokens during e.g pg start
pub private_key_path: PathBuf,
// jwt auth token used for communication with pageserver
pub auth_token: String,
}
impl Default for PageServerConf {
fn default() -> Self {
Self {
id: ZNodeId(0),
listen_pg_addr: String::new(),
listen_http_addr: String::new(),
auth_type: AuthType::Trust,
auth_token: String::new(),
}
}
}
#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(default)]
pub struct SafekeeperConf {
pub id: ZNodeId,
pub pg_port: u16,
pub http_port: u16,
pub sync: bool,
}
impl Default for SafekeeperConf {
fn default() -> Self {
Self {
id: ZNodeId(0),
pg_port: 0,
http_port: 0,
sync: true,
}
}
}
impl LocalEnv {
@@ -58,10 +118,14 @@ impl LocalEnv {
self.pg_distrib_dir.join("lib")
}
pub fn pageserver_bin(&self) -> Result<PathBuf> {
pub fn pageserver_bin(&self) -> anyhow::Result<PathBuf> {
Ok(self.zenith_distrib_dir.join("pageserver"))
}
pub fn safekeeper_bin(&self) -> anyhow::Result<PathBuf> {
Ok(self.zenith_distrib_dir.join("safekeeper"))
}
pub fn pg_data_dirs_path(&self) -> PathBuf {
self.base_data_dir.join("pgdatadirs").join("tenants")
}
@@ -76,6 +140,190 @@ impl LocalEnv {
pub fn pageserver_data_dir(&self) -> PathBuf {
self.base_data_dir.clone()
}
pub fn safekeeper_data_dir(&self, data_dir_name: &str) -> PathBuf {
self.base_data_dir.join("safekeepers").join(data_dir_name)
}
/// Create a LocalEnv from a config file.
///
/// Unlike 'load_config', this function fills in any defaults that are missing
/// from the config file.
pub fn create_config(toml: &str) -> anyhow::Result<Self> {
let mut env: LocalEnv = toml::from_str(toml)?;
// Find postgres binaries.
// Follow POSTGRES_DISTRIB_DIR if set, otherwise look in "tmp_install".
if env.pg_distrib_dir == Path::new("") {
if let Some(postgres_bin) = env::var_os("POSTGRES_DISTRIB_DIR") {
env.pg_distrib_dir = postgres_bin.into();
} else {
let cwd = env::current_dir()?;
env.pg_distrib_dir = cwd.join("tmp_install")
}
}
if !env.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
env.pg_distrib_dir.display()
);
}
// Find zenith binaries.
if env.zenith_distrib_dir == Path::new("") {
env.zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
}
for binary in ["pageserver", "safekeeper"] {
if !env.zenith_distrib_dir.join(binary).exists() {
bail!(
"Can't find binary '{}' in zenith distrib dir '{}'",
binary,
env.zenith_distrib_dir.display()
);
}
}
// If no initial tenant ID was given, generate it.
if env.default_tenantid.is_none() {
env.default_tenantid = Some(ZTenantId::generate());
}
env.base_data_dir = base_path();
Ok(env)
}
/// Locate and load config
pub fn load_config() -> anyhow::Result<Self> {
let repopath = base_path();
if !repopath.exists() {
bail!(
"Zenith config is not found in {}. You need to run 'zenith init' first",
repopath.to_str().unwrap()
);
}
// TODO: check that it looks like a zenith repository
// load and parse file
let config = fs::read_to_string(repopath.join("config"))?;
let mut env: LocalEnv = toml::from_str(config.as_str())?;
env.base_data_dir = repopath;
Ok(env)
}
// this function is used only for testing purposes in CLI e g generate tokens during init
pub fn generate_auth_token(&self, claims: &Claims) -> anyhow::Result<String> {
let private_key_path = if self.private_key_path.is_absolute() {
self.private_key_path.to_path_buf()
} else {
self.base_data_dir.join(&self.private_key_path)
};
let key_data = fs::read(private_key_path)?;
encode_from_key_file(claims, &key_data)
}
//
// Initialize a new Zenith repository
//
pub fn init(&mut self) -> anyhow::Result<()> {
// check if config already exists
let base_path = &self.base_data_dir;
if base_path == Path::new("") {
bail!("repository base path is missing");
}
if base_path.exists() {
bail!(
"directory '{}' already exists. Perhaps already initialized?",
base_path.to_str().unwrap()
);
}
fs::create_dir(&base_path)?;
// generate keys for jwt
// openssl genrsa -out private_key.pem 2048
let private_key_path;
if self.private_key_path == PathBuf::new() {
private_key_path = base_path.join("auth_private_key.pem");
let keygen_output = Command::new("openssl")
.arg("genrsa")
.args(&["-out", private_key_path.to_str().unwrap()])
.arg("2048")
.stdout(Stdio::null())
.output()
.context("failed to generate auth private key")?;
if !keygen_output.status.success() {
bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
self.private_key_path = PathBuf::from("auth_private_key.pem");
let public_key_path = base_path.join("auth_public_key.pem");
// openssl rsa -in private_key.pem -pubout -outform PEM -out public_key.pem
let keygen_output = Command::new("openssl")
.arg("rsa")
.args(&["-in", private_key_path.to_str().unwrap()])
.arg("-pubout")
.args(&["-outform", "PEM"])
.args(&["-out", public_key_path.to_str().unwrap()])
.stdout(Stdio::null())
.output()
.context("failed to generate auth private key")?;
if !keygen_output.status.success() {
bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
}
self.pageserver.auth_token =
self.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
fs::create_dir_all(self.pg_data_dirs_path())?;
for safekeeper in &self.safekeepers {
fs::create_dir_all(SafekeeperNode::datadir_path_by_id(self, safekeeper.id))?;
}
let mut conf_content = String::new();
// Currently, the user first passes a config file with 'zenith init --config=<path>'
// We read that in, in `create_config`, and fill any missing defaults. Then it's saved
// to .zenith/config. TODO: We lose any formatting and comments along the way, which is
// a bit sad.
write!(
&mut conf_content,
r#"# This file describes a locale deployment of the page server
# and safekeeeper node. It is read by the 'zenith' command-line
# utility.
"#
)?;
// Convert the LocalEnv to a toml file.
//
// This could be as simple as this:
//
// conf_content += &toml::to_string_pretty(env)?;
//
// But it results in a "values must be emitted before tables". I'm not sure
// why, AFAICS the table, i.e. 'safekeepers: Vec<SafekeeperConf>' is last.
// Maybe rust reorders the fields to squeeze avoid padding or something?
// In any case, converting to toml::Value first, and serializing that, works.
// See https://github.com/alexcrichton/toml-rs/issues/142
conf_content += &toml::to_string_pretty(&toml::Value::try_from(&self)?)?;
fs::write(base_path.join("config"), conf_content)?;
Ok(())
}
}
fn base_path() -> PathBuf {
@@ -84,119 +332,3 @@ fn base_path() -> PathBuf {
None => ".zenith".into(),
}
}
//
// Initialize a new Zenith repository
//
pub fn init(
pageserver_pg_port: u16,
pageserver_http_port: u16,
tenantid: ZTenantId,
auth_type: AuthType,
) -> Result<()> {
// check if config already exists
let base_path = base_path();
if base_path.exists() {
anyhow::bail!(
"{} already exists. Perhaps already initialized?",
base_path.to_str().unwrap()
);
}
fs::create_dir(&base_path)?;
// ok, now check that expected binaries are present
// Find postgres binaries. Follow POSTGRES_DISTRIB_DIR if set, otherwise look in "tmp_install".
let pg_distrib_dir: PathBuf = {
if let Some(postgres_bin) = env::var_os("POSTGRES_DISTRIB_DIR") {
postgres_bin.into()
} else {
let cwd = env::current_dir()?;
cwd.join("tmp_install")
}
};
if !pg_distrib_dir.join("bin/postgres").exists() {
anyhow::bail!("Can't find postgres binary at {:?}", pg_distrib_dir);
}
// generate keys for jwt
// openssl genrsa -out private_key.pem 2048
let private_key_path = base_path.join("auth_private_key.pem");
let keygen_output = Command::new("openssl")
.arg("genrsa")
.args(&["-out", private_key_path.to_str().unwrap()])
.arg("2048")
.stdout(Stdio::null())
.output()
.with_context(|| "failed to generate auth private key")?;
if !keygen_output.status.success() {
anyhow::bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
let public_key_path = base_path.join("auth_public_key.pem");
// openssl rsa -in private_key.pem -pubout -outform PEM -out public_key.pem
let keygen_output = Command::new("openssl")
.arg("rsa")
.args(&["-in", private_key_path.to_str().unwrap()])
.arg("-pubout")
.args(&["-outform", "PEM"])
.args(&["-out", public_key_path.to_str().unwrap()])
.stdout(Stdio::null())
.output()
.with_context(|| "failed to generate auth private key")?;
if !keygen_output.status.success() {
anyhow::bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
let auth_token =
encode_from_key_path(&Claims::new(None, Scope::PageServerApi), &private_key_path)?;
// Find zenith binaries.
let zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
if !zenith_distrib_dir.join("pageserver").exists() {
anyhow::bail!("Can't find pageserver binary.",);
}
let conf = LocalEnv {
pageserver_pg_port,
pageserver_http_port,
pg_distrib_dir,
zenith_distrib_dir,
base_data_dir: base_path,
tenantid,
auth_token,
auth_type,
private_key_path,
};
fs::create_dir_all(conf.pg_data_dirs_path())?;
let toml = toml::to_string_pretty(&conf)?;
fs::write(conf.base_data_dir.join("config"), toml)?;
Ok(())
}
// Locate and load config
pub fn load_config() -> Result<LocalEnv> {
let repopath = base_path();
if !repopath.exists() {
anyhow::bail!(
"Zenith config is not found in {}. You need to run 'zenith init' first",
repopath.to_str().unwrap()
);
}
// TODO: check that it looks like a zenith repository
// load and parse file
let config = fs::read_to_string(repopath.join("config"))?;
toml::from_str(config.as_str()).map_err(|e| e.into())
}

View File

@@ -0,0 +1,228 @@
///
/// Module for parsing postgresql.conf file.
///
/// NOTE: This doesn't implement the full, correct postgresql.conf syntax. Just
/// enough to extract a few settings we need in Zenith, assuming you don't do
/// funny stuff like include-directives or funny escaping.
use anyhow::{bail, Context, Result};
use lazy_static::lazy_static;
use regex::Regex;
use std::collections::HashMap;
use std::fmt;
use std::io::BufRead;
use std::str::FromStr;
/// In-memory representation of a postgresql.conf file
#[derive(Default)]
pub struct PostgresConf {
lines: Vec<String>,
hash: HashMap<String, String>,
}
lazy_static! {
static ref CONF_LINE_RE: Regex = Regex::new(r"^((?:\w|\.)+)\s*=\s*(\S+)$").unwrap();
}
impl PostgresConf {
pub fn new() -> PostgresConf {
PostgresConf::default()
}
/// Read file into memory
pub fn read(read: impl std::io::Read) -> Result<PostgresConf> {
let mut result = Self::new();
for line in std::io::BufReader::new(read).lines() {
let line = line?;
// Store each line in a vector, in original format
result.lines.push(line.clone());
// Also parse each line and insert key=value lines into a hash map.
//
// FIXME: This doesn't match exactly the flex/bison grammar in PostgreSQL.
// But it's close enough for our usage.
let line = line.trim();
if line.starts_with('#') {
// comment, ignore
continue;
} else if let Some(caps) = CONF_LINE_RE.captures(line) {
let name = caps.get(1).unwrap().as_str();
let raw_val = caps.get(2).unwrap().as_str();
if let Ok(val) = deescape_str(raw_val) {
// Note: if there's already an entry in the hash map for
// this key, this will replace it. That's the behavior what
// we want; when PostgreSQL reads the file, each line
// overrides any previous value for the same setting.
result.hash.insert(name.to_string(), val.to_string());
}
}
}
Ok(result)
}
/// Return the current value of 'option'
pub fn get(&self, option: &str) -> Option<&str> {
self.hash.get(option).map(|x| x.as_ref())
}
/// Return the current value of a field, parsed to the right datatype.
///
/// This calls the FromStr::parse() function on the value of the field. If
/// the field does not exist, or parsing fails, returns an error.
///
pub fn parse_field<T>(&self, field_name: &str, context: &str) -> Result<T>
where
T: FromStr,
<T as FromStr>::Err: std::error::Error + Send + Sync + 'static,
{
self.get(field_name)
.with_context(|| format!("could not find '{}' option {}", field_name, context))?
.parse::<T>()
.with_context(|| format!("could not parse '{}' option {}", field_name, context))
}
pub fn parse_field_optional<T>(&self, field_name: &str, context: &str) -> Result<Option<T>>
where
T: FromStr,
<T as FromStr>::Err: std::error::Error + Send + Sync + 'static,
{
if let Some(val) = self.get(field_name) {
let result = val
.parse::<T>()
.with_context(|| format!("could not parse '{}' option {}", field_name, context))?;
Ok(Some(result))
} else {
Ok(None)
}
}
///
/// Note: if you call this multiple times for the same option, the config
/// file will a line for each call. It would be nice to have a function
/// to change an existing line, but that's a TODO.
///
pub fn append(&mut self, option: &str, value: &str) {
self.lines
.push(format!("{}={}\n", option, escape_str(value)));
self.hash.insert(option.to_string(), value.to_string());
}
/// Append an arbitrary non-setting line to the config file
pub fn append_line(&mut self, line: &str) {
self.lines.push(line.to_string());
}
}
impl fmt::Display for PostgresConf {
/// Return the whole configuration file as a string
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
for line in self.lines.iter() {
f.write_str(line)?;
}
Ok(())
}
}
/// Escape a value for putting in postgresql.conf.
fn escape_str(s: &str) -> String {
// If the string doesn't contain anything that needs quoting or escaping, return it
// as it is.
//
// The first part of the regex, before the '|', matches the INTEGER rule in the
// PostgreSQL flex grammar (guc-file.l). It matches plain integers like "123" and
// "-123", and also accepts units like "10MB". The second part of the regex matches
// the UNQUOTED_STRING rule, and accepts strings that contain a single word, beginning
// with a letter. That covers words like "off" or "posix". Everything else is quoted.
//
// This regex is a bit more conservative than the rules in guc-file.l, so we quote some
// strings that PostgreSQL would accept without quoting, but that's OK.
lazy_static! {
static ref UNQUOTED_RE: Regex =
Regex::new(r"(^[-+]?[0-9]+[a-zA-Z]*$)|(^[a-zA-Z][a-zA-Z0-9]*$)").unwrap();
}
if UNQUOTED_RE.is_match(s) {
s.to_string()
} else {
// Otherwise escape and quote it
let s = s
.replace('\\', "\\\\")
.replace('\n', "\\n")
.replace('\'', "''");
"\'".to_owned() + &s + "\'"
}
}
/// De-escape a possibly-quoted value.
///
/// See `DeescapeQuotedString` function in PostgreSQL sources for how PostgreSQL
/// does this.
fn deescape_str(s: &str) -> Result<String> {
// If the string has a quote at the beginning and end, strip them out.
if s.len() >= 2 && s.starts_with('\'') && s.ends_with('\'') {
let mut result = String::new();
let mut iter = s[1..(s.len() - 1)].chars().peekable();
while let Some(c) = iter.next() {
let newc = if c == '\\' {
match iter.next() {
Some('b') => '\x08',
Some('f') => '\x0c',
Some('n') => '\n',
Some('r') => '\r',
Some('t') => '\t',
Some('0'..='7') => {
// TODO
bail!("octal escapes not supported");
}
Some(n) => n,
None => break,
}
} else if c == '\'' && iter.peek() == Some(&'\'') {
// doubled quote becomes just one quote
iter.next().unwrap()
} else {
c
};
result.push(newc);
}
Ok(result)
} else {
Ok(s.to_string())
}
}
#[test]
fn test_postgresql_conf_escapes() -> Result<()> {
assert_eq!(escape_str("foo bar"), "'foo bar'");
// these don't need to be quoted
assert_eq!(escape_str("foo"), "foo");
assert_eq!(escape_str("123"), "123");
assert_eq!(escape_str("+123"), "+123");
assert_eq!(escape_str("-10"), "-10");
assert_eq!(escape_str("1foo"), "1foo");
assert_eq!(escape_str("foo1"), "foo1");
assert_eq!(escape_str("10MB"), "10MB");
assert_eq!(escape_str("-10kB"), "-10kB");
// these need quoting and/or escaping
assert_eq!(escape_str("foo bar"), "'foo bar'");
assert_eq!(escape_str("fo'o"), "'fo''o'");
assert_eq!(escape_str("fo\no"), "'fo\\no'");
assert_eq!(escape_str("fo\\o"), "'fo\\\\o'");
assert_eq!(escape_str("10 cats"), "'10 cats'");
// Test de-escaping
assert_eq!(deescape_str(&escape_str("foo"))?, "foo");
assert_eq!(deescape_str(&escape_str("fo'o\nba\\r"))?, "fo'o\nba\\r");
assert_eq!(deescape_str("'\\b\\f\\n\\r\\t'")?, "\x08\x0c\n\r\t");
// octal-escapes are currently not supported
assert!(deescape_str("'foo\\7\\07\\007'").is_err());
Ok(())
}

View File

@@ -0,0 +1,264 @@
use std::io::Write;
use std::net::TcpStream;
use std::path::PathBuf;
use std::process::Command;
use std::sync::Arc;
use std::time::Duration;
use std::{io, result, thread};
use anyhow::bail;
use nix::errno::Errno;
use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid;
use postgres::Config;
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::zid::ZNodeId;
use crate::local_env::{LocalEnv, SafekeeperConf};
use crate::storage::PageServerNode;
use crate::{fill_rust_env_vars, read_pidfile};
use zenith_utils::connstring::connection_address;
#[derive(Error, Debug)]
pub enum SafekeeperHttpError {
#[error("Reqwest error: {0}")]
Transport(#[from] reqwest::Error),
#[error("Error: {0}")]
Response(String),
}
type Result<T> = result::Result<T, SafekeeperHttpError>;
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
}
// reqwest do not export it's error construction utility functions, so lets craft the message ourselves
let url = self.url().to_owned();
Err(SafekeeperHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
))
}
}
//
// Control routines for safekeeper.
//
// Used in CLI and tests.
//
#[derive(Debug)]
pub struct SafekeeperNode {
pub id: ZNodeId,
pub conf: SafekeeperConf,
pub pg_connection_config: Config,
pub env: LocalEnv,
pub http_client: Client,
pub http_base_url: String,
pub pageserver: Arc<PageServerNode>,
}
impl SafekeeperNode {
pub fn from_env(env: &LocalEnv, conf: &SafekeeperConf) -> SafekeeperNode {
let pageserver = Arc::new(PageServerNode::from_env(env));
println!("initializing for sk {} for {}", conf.id, conf.http_port);
SafekeeperNode {
id: conf.id,
conf: conf.clone(),
pg_connection_config: Self::safekeeper_connection_config(conf.pg_port),
env: env.clone(),
http_client: Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
pageserver,
}
}
/// Construct libpq connection string for connecting to this safekeeper.
fn safekeeper_connection_config(port: u16) -> Config {
// TODO safekeeper authentication not implemented yet
format!("postgresql://no_user@127.0.0.1:{}/no_db", port)
.parse()
.unwrap()
}
pub fn datadir_path_by_id(env: &LocalEnv, sk_id: ZNodeId) -> PathBuf {
env.safekeeper_data_dir(format!("sk{}", sk_id).as_ref())
}
pub fn datadir_path(&self) -> PathBuf {
SafekeeperNode::datadir_path_by_id(&self.env, self.id)
}
pub fn pid_file(&self) -> PathBuf {
self.datadir_path().join("safekeeper.pid")
}
pub fn start(&self) -> anyhow::Result<()> {
print!(
"Starting safekeeper at '{}' in '{}'",
connection_address(&self.pg_connection_config),
self.datadir_path().display()
);
io::stdout().flush().unwrap();
let listen_pg = format!("127.0.0.1:{}", self.conf.pg_port);
let listen_http = format!("127.0.0.1:{}", self.conf.http_port);
let mut cmd = Command::new(self.env.safekeeper_bin()?);
fill_rust_env_vars(
cmd.args(&["-D", self.datadir_path().to_str().unwrap()])
.args(&["--id", self.id.to_string().as_ref()])
.args(&["--listen-pg", &listen_pg])
.args(&["--listen-http", &listen_http])
.args(&["--recall", "1 second"])
.arg("--daemonize"),
);
if !self.conf.sync {
cmd.arg("--no-sync");
}
if !cmd.status()?.success() {
bail!(
"Safekeeper failed to start. See '{}' for details.",
self.datadir_path().join("safekeeper.log").display()
);
}
// It takes a while for the safekeeper to start up. Wait until it is
// open for business.
const RETRIES: i8 = 15;
for retries in 1..RETRIES {
match self.check_status() {
Ok(_) => {
println!("\nSafekeeper started");
return Ok(());
}
Err(err) => {
match err {
SafekeeperHttpError::Transport(err) => {
if err.is_connect() && retries < 5 {
print!(".");
io::stdout().flush().unwrap();
} else {
if retries == 5 {
println!() // put a line break after dots for second message
}
println!(
"Safekeeper not responding yet, err {} retrying ({})...",
err, retries
);
}
}
SafekeeperHttpError::Response(msg) => {
bail!("safekeeper failed to start: {} ", msg)
}
}
thread::sleep(Duration::from_secs(1));
}
}
}
bail!("safekeeper failed to start in {} seconds", RETRIES);
}
///
/// Stop the server.
///
/// If 'immediate' is true, we use SIGQUIT, killing the process immediately.
/// Otherwise we use SIGTERM, triggering a clean shutdown
///
/// If the server is not running, returns success
///
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
let pid_file = self.pid_file();
if !pid_file.exists() {
println!("Safekeeper {} is already stopped", self.id);
return Ok(());
}
let pid = read_pidfile(&pid_file)?;
let pid = Pid::from_raw(pid);
let sig = if immediate {
println!("Stop safekeeper immediately");
Signal::SIGQUIT
} else {
println!("Stop safekeeper gracefully");
Signal::SIGTERM
};
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
println!(
"Safekeeper with pid {} does not exist, but a PID file was found",
pid
);
return Ok(());
}
Err(err) => bail!(
"Failed to send signal to safekeeper with pid {}: {}",
pid,
err.desc()
),
}
let address = connection_address(&self.pg_connection_config);
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if safekeeper flushes a lot of data
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Safekeeper stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Safekeeper status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Safekeeper status is: {}", err);
return Ok(());
}
}
} else {
println!("Safekeeper still receives connections");
thread::sleep(Duration::from_secs(1));
}
}
bail!("Failed to stop safekeeper with pid {}", pid);
}
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
// TODO: authentication
//if self.env.auth_type == AuthType::ZenithJWT {
// builder = builder.bearer_auth(&self.env.safekeeper_auth_token)
//}
self.http_client.request(method, url)
}
pub fn check_status(&self) -> Result<()> {
self.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status"))
.send()?
.error_from_body()?;
Ok(())
}
}

View File

@@ -5,7 +5,8 @@ use std::process::Command;
use std::time::Duration;
use std::{io, result, thread};
use anyhow::{anyhow, bail};
use anyhow::bail;
use nix::errno::Errno;
use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid;
use pageserver::http::models::{BranchCreateRequest, TenantCreateRequest};
@@ -18,8 +19,9 @@ use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::ZTenantId;
use crate::local_env::LocalEnv;
use crate::read_pidfile;
use crate::{fill_rust_env_vars, read_pidfile};
use pageserver::branches::BranchInfo;
use pageserver::tenant_mgr::TenantInfo;
use zenith_utils::connstring::connection_address;
#[derive(Error, Debug)]
@@ -62,7 +64,6 @@ impl ResponseErrorMessageExt for Response {
//
#[derive(Debug)]
pub struct PageServerNode {
pub kill_on_exit: bool,
pub pg_connection_config: Config,
pub env: LocalEnv,
pub http_client: Client,
@@ -71,67 +72,84 @@ pub struct PageServerNode {
impl PageServerNode {
pub fn from_env(env: &LocalEnv) -> PageServerNode {
let password = if env.auth_type == AuthType::ZenithJWT {
&env.auth_token
let password = if env.pageserver.auth_type == AuthType::ZenithJWT {
&env.pageserver.auth_token
} else {
""
};
PageServerNode {
kill_on_exit: false,
Self {
pg_connection_config: Self::pageserver_connection_config(
password,
env.pageserver_pg_port,
&env.pageserver.listen_pg_addr,
),
env: env.clone(),
http_client: Client::new(),
http_base_url: format!("http://localhost:{}/v1", env.pageserver_http_port),
http_base_url: format!("http://{}/v1", env.pageserver.listen_http_addr),
}
}
fn pageserver_connection_config(password: &str, port: u16) -> Config {
format!("postgresql://no_user:{}@localhost:{}/no_db", password, port)
/// Construct libpq connection string for connecting to the pageserver.
fn pageserver_connection_config(password: &str, listen_addr: &str) -> Config {
format!("postgresql://no_user:{}@{}/no_db", password, listen_addr)
.parse()
.unwrap()
}
pub fn init(&self, create_tenant: Option<&str>, enable_auth: bool) -> anyhow::Result<()> {
pub fn init(
&self,
create_tenant: Option<&str>,
config_overrides: &[&str],
) -> anyhow::Result<()> {
let mut cmd = Command::new(self.env.pageserver_bin()?);
let listen_pg = format!("localhost:{}", self.env.pageserver_pg_port);
let listen_http = format!("localhost:{}", self.env.pageserver_http_port);
let mut args = vec![
"--init",
"-D",
self.env.base_data_dir.to_str().unwrap(),
"--postgres-distrib",
self.env.pg_distrib_dir.to_str().unwrap(),
"--listen-pg",
&listen_pg,
"--listen-http",
&listen_http,
];
if enable_auth {
args.extend(&["--auth-validation-public-key-path", "auth_public_key.pem"]);
args.extend(&["--auth-type", "ZenithJWT"]);
let id = format!("id={}", self.env.pageserver.id);
// FIXME: the paths should be shell-escaped to handle paths with spaces, quotas etc.
let base_data_dir_param = self.env.base_data_dir.display().to_string();
let pg_distrib_dir_param =
format!("pg_distrib_dir='{}'", self.env.pg_distrib_dir.display());
let authg_type_param = format!("auth_type='{}'", self.env.pageserver.auth_type);
let listen_http_addr_param = format!(
"listen_http_addr='{}'",
self.env.pageserver.listen_http_addr
);
let listen_pg_addr_param =
format!("listen_pg_addr='{}'", self.env.pageserver.listen_pg_addr);
let mut args = Vec::with_capacity(20);
args.push("--init");
args.extend(["-D", &base_data_dir_param]);
args.extend(["-c", &pg_distrib_dir_param]);
args.extend(["-c", &authg_type_param]);
args.extend(["-c", &listen_http_addr_param]);
args.extend(["-c", &listen_pg_addr_param]);
args.extend(["-c", &id]);
for config_override in config_overrides {
args.extend(["-c", config_override]);
}
if self.env.pageserver.auth_type != AuthType::Trust {
args.extend([
"-c",
"auth_validation_public_key_path='auth_public_key.pem'",
]);
}
if let Some(tenantid) = create_tenant {
args.extend(&["--create-tenant", tenantid])
args.extend(["--create-tenant", tenantid])
}
let status = cmd
.args(args)
.env_clear()
.env("RUST_BACKTRACE", "1")
let status = fill_rust_env_vars(cmd.args(args))
.status()
.expect("pageserver init failed");
if status.success() {
Ok(())
} else {
Err(anyhow!("pageserver init failed"))
if !status.success() {
bail!("pageserver init failed");
}
Ok(())
}
pub fn repo_path(&self) -> PathBuf {
@@ -142,7 +160,7 @@ impl PageServerNode {
self.repo_path().join("pageserver.pid")
}
pub fn start(&self) -> anyhow::Result<()> {
pub fn start(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
print!(
"Starting pageserver at '{}' in '{}'",
connection_address(&self.pg_connection_config),
@@ -151,10 +169,15 @@ impl PageServerNode {
io::stdout().flush().unwrap();
let mut cmd = Command::new(self.env.pageserver_bin()?);
cmd.args(&["-D", self.repo_path().to_str().unwrap()])
.arg("-d")
.env_clear()
.env("RUST_BACKTRACE", "1");
let repo_path = self.repo_path();
let mut args = vec!["-D", repo_path.to_str().unwrap()];
for config_override in config_overrides {
args.extend(["-c", config_override]);
}
fill_rust_env_vars(cmd.args(&args).arg("--daemonize"));
if !cmd.status()?.success() {
bail!(
@@ -199,23 +222,69 @@ impl PageServerNode {
bail!("pageserver failed to start in {} seconds", RETRIES);
}
pub fn stop(&self) -> anyhow::Result<()> {
let pid = read_pidfile(&self.pid_file())?;
let pid = Pid::from_raw(pid);
if kill(pid, Signal::SIGTERM).is_err() {
bail!("Failed to kill pageserver with pid {}", pid);
///
/// Stop the server.
///
/// If 'immediate' is true, we use SIGQUIT, killing the process immediately.
/// Otherwise we use SIGTERM, triggering a clean shutdown
///
/// If the server is not running, returns success
///
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
let pid_file = self.pid_file();
if !pid_file.exists() {
println!("Pageserver is already stopped");
return Ok(());
}
let pid = Pid::from_raw(read_pidfile(&pid_file)?);
// wait for pageserver stop
let address = connection_address(&self.pg_connection_config);
for _ in 0..5 {
let stream = TcpStream::connect(&address);
thread::sleep(Duration::from_secs(1));
if let Err(_e) = stream {
println!("Pageserver stopped");
let sig = if immediate {
println!("Stop pageserver immediately");
Signal::SIGQUIT
} else {
println!("Stop pageserver gracefully");
Signal::SIGTERM
};
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
println!(
"Pageserver with pid {} does not exist, but a PID file was found",
pid
);
return Ok(());
}
println!("Stopping pageserver on {}", address);
Err(err) => bail!(
"Failed to send signal to pageserver with pid {}: {}",
pid,
err.desc()
),
}
let address = connection_address(&self.pg_connection_config);
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if pageserver checkpoints a lot of data
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Pageserver stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Pageserver status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Pageserver status is: {}", err);
return Ok(());
}
}
} else {
println!("Pageserver still receives connections");
thread::sleep(Duration::from_secs(1));
}
}
bail!("Failed to stop pageserver with pid {}", pid);
@@ -234,8 +303,8 @@ impl PageServerNode {
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
let mut builder = self.http_client.request(method, url);
if self.env.auth_type == AuthType::ZenithJWT {
builder = builder.bearer_auth(&self.env.auth_token)
if self.env.pageserver.auth_type == AuthType::ZenithJWT {
builder = builder.bearer_auth(&self.env.pageserver.auth_token)
}
builder
}
@@ -247,7 +316,7 @@ impl PageServerNode {
Ok(())
}
pub fn tenant_list(&self) -> Result<Vec<String>> {
pub fn tenant_list(&self) -> Result<Vec<TenantInfo>> {
Ok(self
.http_request(Method::GET, format!("{}/{}", self.http_base_url, "tenant"))
.send()?
@@ -310,11 +379,3 @@ impl PageServerNode {
.json()?)
}
}
impl Drop for PageServerNode {
fn drop(&mut self) {
if self.kill_on_exit {
let _ = self.stop();
}
}
}

View File

@@ -4,10 +4,10 @@ set -eux
if [ "$1" = 'pageserver' ]; then
if [ ! -d "/data/tenants" ]; then
echo "Initializing pageserver data directory"
pageserver --init -D /data --postgres-distrib /usr/local
pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10"
fi
echo "Staring pageserver at 0.0.0.0:6400"
pageserver -l 0.0.0.0:6400 -D /data
pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /data
else
"$@"
fi

View File

@@ -10,5 +10,5 @@
- [pageserver/README](/pageserver/README) — pageserver overview.
- [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
- [walkeeper/README](/walkeeper/README.md) — WAL service overview.
- [walkeeper/README](/walkeeper/README) — WAL service overview.
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core

View File

@@ -4,7 +4,7 @@
Currently we build two main images:
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `wal_acceptor` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres).
And two intermediate images used either to reduce build time or to deliver some additional binary tools from other repos:

View File

@@ -2,6 +2,16 @@
### Authentication
### Backpresssure
Backpressure is used to limit the lag between pageserver and compute node or WAL service.
If compute node or WAL service run far ahead of Page Server,
the time of serving page requests increases. This may lead to timeout errors.
To tune backpressure limits use `max_replication_write_lag`, `max_replication_flush_lag` and `max_replication_apply_lag` settings.
When lag between current LSN (pg_current_wal_flush_lsn() at compute node) and minimal write/flush/apply position of replica exceeds the limit
backends performing writes are blocked until the replica is caught up.
### Base image (page image)
### Basebackup
@@ -51,11 +61,14 @@ Each PostgreSQL fork is considered a separate relish.
### Layer
Each layer corresponds to the specific version of a relish Segment in a range of LSNs.
A layer contains data needed to reconstruct any page versions within the
layer's Segment and range of LSNs.
There are two kinds of layers, in-memory and on-disk layers. In-memory
layers are used to ingest incoming WAL, and provide fast access
to the recent page versions. On-disk layers are stored as files on disk, and
are immutable.
are immutable. See pageserver/src/layered_repository/README.md for more.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
@@ -73,7 +86,37 @@ The layer map tracks what layers exist for all the relishes in a timeline.
Zenith repository implementation that keeps data in layers.
### LSN
The Log Sequence Number (LSN) is a unique identifier of the WAL record[] in the WAL log.
The insert position is a byte offset into the logs, increasing monotonically with each new record.
Internally, an LSN is a 64-bit integer, representing a byte position in the write-ahead log stream.
It is printed as two hexadecimal numbers of up to 8 digits each, separated by a slash.
Check also [PostgreSQL doc about pg_lsn type](https://www.postgresql.org/docs/devel/datatype-pg-lsn.html)
Values can be compared to calculate the volume of WAL data that separates them, so they are used to measure the progress of replication and recovery.
In postgres and Zenith lsns are used to describe certain points in WAL handling.
PostgreSQL LSNs and functions to monitor them:
* `pg_current_wal_insert_lsn()` - Returns the current write-ahead log insert location.
* `pg_current_wal_lsn()` - Returns the current write-ahead log write location.
* `pg_current_wal_flush_lsn()` - Returns the current write-ahead log flush location.
* `pg_last_wal_receive_lsn()` - Returns the last write-ahead log location that has been received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically.
* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
[source PostgreSQL documentation](https://www.postgresql.org/docs/devel/functions-admin.html):
Zenith safekeeper LSNs. For more check [walkeeper/README_PROTO.md](/walkeeper/README_PROTO.md)
* `CommitLSN`: position in WAL confirmed by quorum safekeepers.
* `RestartLSN`: position in WAL confirmed by all safekeepers.
* `FlushLSN`: part of WAL persisted to the disk by safekeeper.
* `VCL`: the largerst LSN for which we can guarantee availablity of all prior records.
Zenith pageserver LSNs:
* `last_record_lsn` - the end of last processed WAL record.
* `disk_consistent_lsn` - data is known to be fully flushed and fsync'd to local disk on pageserver up to this LSN.
* `remote_consistent_lsn` - The last LSN that is synced to remote storage and is guaranteed to survive pageserver crash.
TODO: use this name consistently in remote storage code. Now `disk_consistent_lsn` is used and meaning depends on the context.
* `ancestor_lsn` - LSN of the branch point (the LSN at which this branch was created)
TODO: add table that describes mapping between PostgreSQL (compute), safekeeper and pageserver LSNs.
### Page (block)
The basic structure used to store relation data. All pages are of the same size.

View File

@@ -56,4 +56,4 @@ Tenant id is passed to postgres via GUC the same way as the timeline. Tenant id
### Safety
For now particular tenant can only appear on a particular pageserver. Set of WAL acceptors are also pinned to particular (tenantid, timeline) pair so there can only be one writer for particular (tenantid, timeline).
For now particular tenant can only appear on a particular pageserver. Set of safekeepers are also pinned to particular (tenantid, timeline) pair so there can only be one writer for particular (tenantid, timeline).

View File

@@ -0,0 +1,22 @@
## Pageserver tenant migration
### Overview
This feature allows to migrate a timeline from one pageserver to another by utilizing remote storage capability.
### Migration process
Pageserver implements two new http handlers: timeline attach and timeline detach.
Timeline migration is performed in a following way:
1. Timeline attach is called on a target pageserver. This asks pageserver to download latest checkpoint uploaded to s3.
2. For now it is necessary to manually initialize replication stream via callmemaybe call so target pageserver initializes replication from safekeeper (it is desired to avoid this and initialize replication directly in attach handler, but this requires some refactoring (probably [#997](https://github.com/zenithdb/zenith/issues/997)/[#1049](https://github.com/zenithdb/zenith/issues/1049))
3. Replication state can be tracked via timeline detail pageserver call.
4. Compute node should be restarted with new pageserver connection string. Issue with multiple compute nodes for one timeline is handled on the safekeeper consensus level. So this is not a problem here.Currently responsibility for rescheduling the compute with updated config lies on external coordinator (console).
5. Timeline is detached from old pageserver. On disk data is removed.
### Implementation details
Now safekeeper needs to track which pageserver it is replicating to. This introduces complications into replication code:
* We need to distinguish different pageservers (now this is done by connection string which is imperfect and is covered here: https://github.com/zenithdb/zenith/issues/1105). Callmemaybe subscription management also needs to track that (this is already implemented).
* We need to track which pageserver is the primary. This is needed to avoid reconnections to non primary pageservers. Because we shouldn't reconnect to them when they decide to stop their walreceiver. I e this can appear when there is a load on the compute and we are trying to detach timeline from old pageserver. In this case callmemaybe will try to reconnect to it because replication termination condition is not met (page server with active compute could never catch up to the latest lsn, so there is always some wal tail)

180
docs/settings.md Normal file
View File

@@ -0,0 +1,180 @@
## Pageserver
Pageserver is mainly configured via a `pageserver.toml` config file.
If there's no such file during `init` phase of the server, it creates the file itself. Without 'init', the file is read.
There's a possibility to pass an arbitrary config value to the pageserver binary as an argument: such values override
the values in the config file, if any are specified for the same key and get into the final config during init phase.
### Config example
```toml
# Initial configuration file created by 'pageserver --init'
listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
checkpoint_distance = '268435456' # in bytes
checkpoint_period = '1 s'
gc_period = '100 s'
gc_horizon = '67108864'
max_file_descriptors = '100'
# initial superuser role name to use when creating a new tenant
initial_superuser_name = 'zenith_admin'
# [remote_storage]
```
The config above shows default values for all basic pageserver settings.
Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank.
Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start.
Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#table) in TOML specification and
* either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'`
* or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}`
### Config values
All values can be passed as an argument to the pageserver binary, using the `-c` parameter and specified as a valid TOML string. All tables should be passed in the inline form.
Example: `${PAGESERVER_BIN} -c "checkpoint_period = '100 s'" -c "remote_storage={local_path='/some/local/path/'}"`
Note that TOML distinguishes between strings and integers, the former require single or double quotes around them.
#### checkpoint_distance
`checkpoint_distance` is the amount of incoming WAL that is held in
the open layer, before it's flushed to local disk. It puts an upper
bound on how much WAL needs to be re-processed after a pageserver
crash. It is a soft limit, the pageserver can momentarily go above it,
but it will trigger a checkpoint operation to get it back below the
limit.
`checkpoint_distance` also determines how much WAL needs to be kept
durable in the safekeeper. The safekeeper must have capacity to hold
this much WAL, with some headroom, otherwise you can get stuck in a
situation where the safekeeper is full and stops accepting new WAL,
but the pageserver is not flushing out and releasing the space in the
safekeeper because it hasn't reached checkpoint_distance yet.
`checkpoint_distance` also controls how often the WAL is uploaded to
S3.
The unit is # of bytes.
#### checkpoint_period
The pageserver checks whether `checkpoint_distance` has been reached
every `checkpoint_period` seconds. Default is 1 s, which should be
fine.
#### gc_horizon
`gz_horizon` determines how much history is retained, to allow
branching and read replicas at an older point in time. The unit is #
of bytes of WAL. Page versions older than this are garbage collected
away.
#### gc_period
Interval at which garbage collection is triggered. Default is 100 s.
#### initial_superuser_name
Name of the initial superuser role, passed to initdb when a new tenant
is initialized. It doesn't affect anything after initialization. The
default is Note: The default is 'zenith_admin', and the console
depends on that, so if you change it, bad things will happen.
#### page_cache_size
Size of the page cache, to hold materialized page versions. Unit is
number of 8 kB blocks. The default is 8192, which means 64 MB.
#### max_file_descriptors
Max number of file descriptors to hold open concurrently for accessing
layer files. This should be kept well below the process/container/OS
limit (see `ulimit -n`), as the pageserver also needs file descriptors
for other files and for sockets for incoming connections.
#### pg_distrib_dir
A directory with Postgres installation to use during pageserver activities.
Inside that dir, a `bin/postgres` binary should be present.
The default distrib dir is `./tmp_install/`.
#### workdir (-D)
A directory in the file system, where pageserver will store its files.
The default is `./.zenith/`.
This parameter has a special CLI alias (`-D`) and can not be overridden with regular `-c` way.
##### Remote storage
There's a way to automatically back up and restore some of the pageserver's data from working dir to the remote storage.
The backup system is disabled by default and can be enabled for either of the currently available storages:
###### Local FS storage
Pageserver can back up and restore some of its workdir contents to another directory.
For that, only a path to that directory needs to be specified as a parameter:
```toml
[remote_storage]
local_path = '/some/local/path/'
```
###### S3 storage
Pageserver can back up and restore some of its workdir contents to S3.
Full set of S3 credentials is needed for that as parameters.
Configuration example:
```toml
[remote_storage]
# Name of the bucket to connect to
bucket_name = 'some-sample-bucket'
# Name of the region where the bucket is located at
bucket_region = 'eu-north-1'
# A "subfolder" in the bucket, to use the same bucket separately by multiple pageservers at once.
# Optional, pageserver uses entire bucket if the prefix is not specified.
prefix_in_bucket = '/some/prefix/'
# Access key to connect to the bucket ("login" part of the credentials)
access_key_id = 'SOMEKEYAAAAASADSAH*#'
# Secret access key to connect to the bucket ("password" part of the credentials)
secret_access_key = 'SOMEsEcReTsd292v'
```
###### General remote storage configuration
Pagesever allows only one remote storage configured concurrently and errors if parameters from multiple different remote configurations are used.
No default values are used for the remote storage configuration parameters.
Besides, there are parameters common for all types of remote storage that can be configured, those have defaults:
```toml
[remote_storage]
# Max number of concurrent connections to open for uploading to or downloading from the remote storage.
max_concurrent_sync = 100
# Max number of errors a single task can have before it's considered failed and not attempted to run anymore.
max_sync_errors = 10
```
## safekeeper
TODO

View File

@@ -79,3 +79,48 @@ Helpers for exposing Prometheus metrics from the server.
`/zenith_utils`:
Helpers that are shared between other crates in this repository.
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.
A single virtual environment with all dependencies is described in the single `Pipfile`.
### Prerequisites
- Install Python 3.7 (the minimal supported version) or greater.
- Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected.
- If you have some trouble with other version you can resolve it by installing Python 3.7 separately, via pyenv or via system package manager e.g.:
```bash
# In Ubuntu
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
```
- Install `poetry`
- Exact version of `poetry` is not important, see installation instructions available at poetry's [website](https://python-poetry.org/docs/#installation)`.
- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.7 so if you have different version some linting tools can yield different result locally vs in the CI.
Run `poetry shell` to activate the virtual environment.
Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`.
### Obligatory checks
We force code formatting via `yapf` and type hints via `mypy`.
Run the following commands in the repository's root (next to `setup.cfg`):
```bash
poetry run yapf -ri . # All code is reformatted
poetry run mypy . # Ensure there are no typing errors
```
**WARNING**: do not run `mypy` from a directory other than the root of the repository.
Otherwise it will not find its configuration.
Also consider:
* Running `flake8` (or a linter of your choice, e.g. `pycodestyle`) and fixing possible defects, if any.
* Adding more type hints to your code to avoid `Any`.
### Changing dependencies
To add new package or change an existing one you can use `poetry add` or `poetry update` or edit `pyproject.toml` manually. Do not forget to run `poetry lock` in the latter case.
More details are available in poetry's [documentation](https://python-poetry.org/docs/).

View File

@@ -1,11 +1,10 @@
[package]
name = "pageserver"
version = "0.1.0"
authors = ["Stas Kelvich <stas@zenith.tech>"]
edition = "2018"
edition = "2021"
[dependencies]
bookfile = "^0.3"
bookfile = { git = "https://github.com/zenithdb/bookfile.git", branch="generic-readext" }
chrono = "0.4.19"
rand = "0.8.3"
regex = "1.4.5"
@@ -15,14 +14,15 @@ futures = "0.3.13"
hyper = "0.14"
lazy_static = "1.4.0"
log = "0.4.14"
clap = "2.33.0"
clap = "3.0"
daemonize = "0.4.1"
tokio = { version = "1.11", features = ["process", "macros", "fs"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
routerify = "2"
anyhow = "1.0"
tokio = { version = "1.11", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-stream = "0.1.8"
anyhow = { version = "1.0", features = ["backtrace"] }
crc32c = "0.6.0"
thiserror = "1.0"
hex = { version = "0.4.3", features = ["serde"] }
@@ -30,12 +30,27 @@ tar = "0.4.33"
humantime = "2.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
toml = "0.5"
toml_edit = { version = "0.13", features = ["easy"] }
scopeguard = "1.1.0"
rust-s3 = { version = "0.27.0-rc4", features = ["no-verify-ssl"] }
async-trait = "0.1"
const_format = "0.2.21"
tracing = "0.1.27"
tracing-futures = "0.2"
signal-hook = "0.3.10"
url = "2"
nix = "0.23"
once_cell = "1.8.0"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
rust-s3 = { version = "0.28", default-features = false, features = ["no-verify-ssl", "tokio-rustls-tls"] }
async-compression = {version = "0.3", features = ["zstd", "tokio"]}
postgres_ffi = { path = "../postgres_ffi" }
zenith_metrics = { path = "../zenith_metrics" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { path = "../workspace_hack" }
[dev-dependencies]
hex-literal = "0.3"
tempfile = "3.2"

View File

@@ -7,8 +7,9 @@ The Page Server has a few different duties:
- Replay WAL that's applicable to the chunks that the Page Server maintains
- Backup to S3
S3 is the main fault-tolerant storage of all data, as there are no Page Server
replicas. We use a separate fault-tolerant WAL service to reduce latency. It
keeps track of WAL records which are not synced to S3 yet.
The Page Server consists of multiple threads that operate on a shared
repository of page versions:
@@ -40,7 +41,7 @@ Legend:
+--+
....
. . Component that we will need, but doesn't exist at the moment. A TODO.
. . Component at its early development phase.
....
---> Data flow
@@ -115,13 +116,50 @@ Remove old on-disk layer files that are no longer needed according to the
PITR retention policy
TODO: Backup service
--------------------
### Backup service
The backup service is responsible for periodically pushing the chunks to S3.
The backup service, responsible for storing pageserver recovery data externally.
TODO: How/when do restore from S3? Whenever we get a GetPage@LSN request for
a chunk we don't currently have? Or when an external Control Plane tells us?
Currently, pageserver stores its files in a filesystem directory it's pointed to.
That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
Therefore, the server interacts with external, more reliable storage to back up and restore its state.
The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
There are the following implementations present:
* local filesystem — to use in tests mainly
* AWS S3 - to use in production
Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs, parameters documentation can be found at [settings docs](../docs/settings.md).
The backup service is disabled by default and can be enabled to interact with a single remote storage.
CLI examples:
* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
* AWS S3 : `${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/',access_key_id='SOMEKEYAAAAASADSAH*#',secret_access_key='SOMEsEcReTsd292v'}"`
For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
For local S3 installations, refer to the their documentation for name format and credentials.
Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
Required sections are:
```toml
[remote_storage]
local_path = '/Users/someonetoignore/Downloads/tmp_dir/'
```
or
```toml
[remote_storage]
bucket_name = 'some-sample-bucket'
bucket_region = 'eu-north-1'
prefix_in_bucket = '/test_prefix/'
access_key_id = 'SOMEKEYAAAAASADSAH*#'
secret_access_key = 'SOMEsEcReTsd292v'
```
Also, `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` variables can be used to specify the credentials instead of any of the ways above.
TODO: Sharding
--------------------

View File

@@ -10,9 +10,10 @@
//! This module is responsible for creation of such tarball
//! from data stored in object storage.
//!
use anyhow::Result;
use anyhow::{Context, Result};
use bytes::{BufMut, BytesMut};
use log::*;
use std::fmt::Write as FmtWrite;
use std::io;
use std::io::Write;
use std::sync::Arc;
@@ -31,7 +32,7 @@ use zenith_utils::lsn::Lsn;
pub struct Basebackup<'a> {
ar: Builder<&'a mut dyn Write>,
timeline: &'a Arc<dyn Timeline>,
lsn: Lsn,
pub lsn: Lsn,
prev_record_lsn: Lsn,
}
@@ -83,7 +84,7 @@ impl<'a> Basebackup<'a> {
info!(
"taking basebackup lsn={}, prev_lsn={}",
backup_prev, backup_lsn
backup_lsn, backup_prev
);
Ok(Basebackup {
@@ -97,7 +98,6 @@ impl<'a> Basebackup<'a> {
pub fn send_tarball(&mut self) -> anyhow::Result<()> {
// Create pgdata subdirs structure
for dir in pg_constants::PGDATA_SUBDIRS.iter() {
info!("send subdir {:?}", *dir);
let header = new_tar_header_dir(*dir)?;
self.ar.append(&header, &mut io::empty())?;
}
@@ -242,20 +242,16 @@ impl<'a> Basebackup<'a> {
fn add_pgcontrol_file(&mut self) -> anyhow::Result<()> {
let checkpoint_bytes = self
.timeline
.get_page_at_lsn(RelishTag::Checkpoint, 0, self.lsn)?;
let pg_control_bytes =
self.timeline
.get_page_at_lsn(RelishTag::ControlFile, 0, self.lsn)?;
.get_page_at_lsn(RelishTag::Checkpoint, 0, self.lsn)
.context("failed to get checkpoint bytes")?;
let pg_control_bytes = self
.timeline
.get_page_at_lsn(RelishTag::ControlFile, 0, self.lsn)
.context("failed get control bytes")?;
let mut pg_control = ControlFileData::decode(&pg_control_bytes)?;
let mut checkpoint = CheckPoint::decode(&checkpoint_bytes)?;
// Generate new pg_control and WAL needed for bootstrap
let checkpoint_segno = self.lsn.segment_number(pg_constants::WAL_SEGMENT_SIZE);
let checkpoint_lsn = XLogSegNoOffsetToRecPtr(
checkpoint_segno,
XLOG_SIZE_OF_XLOG_LONG_PHD as u32,
pg_constants::WAL_SEGMENT_SIZE,
);
// Generate new pg_control needed for bootstrap
checkpoint.redo = normalize_lsn(self.lsn, pg_constants::WAL_SEGMENT_SIZE).0;
//reset some fields we don't want to preserve
@@ -264,19 +260,24 @@ impl<'a> Basebackup<'a> {
checkpoint.oldestActiveXid = 0;
//save new values in pg_control
pg_control.checkPoint = checkpoint_lsn;
pg_control.checkPoint = 0;
pg_control.checkPointCopy = checkpoint;
pg_control.state = pg_constants::DB_SHUTDOWNED;
// add zenith.signal file
let xl_prev = if self.prev_record_lsn == Lsn(0) {
0xBAD0 // magic value to indicate that we don't know prev_lsn
let mut zenith_signal = String::new();
if self.prev_record_lsn == Lsn(0) {
if self.lsn == self.timeline.get_ancestor_lsn() {
write!(zenith_signal, "PREV LSN: none")?;
} else {
write!(zenith_signal, "PREV LSN: invalid")?;
}
} else {
self.prev_record_lsn.0
};
write!(zenith_signal, "PREV LSN: {}", self.prev_record_lsn)?;
}
self.ar.append(
&new_tar_header("zenith.signal", 8)?,
&xl_prev.to_le_bytes()[..],
&new_tar_header("zenith.signal", zenith_signal.len() as u64)?,
zenith_signal.as_bytes(),
)?;
//send pg_control
@@ -285,14 +286,11 @@ impl<'a> Basebackup<'a> {
self.ar.append(&header, &pg_control_bytes[..])?;
//send wal segment
let wal_file_name = XLogFileName(
1, // FIXME: always use Postgres timeline 1
checkpoint_segno,
pg_constants::WAL_SEGMENT_SIZE,
);
let segno = self.lsn.segment_number(pg_constants::WAL_SEGMENT_SIZE);
let wal_file_name = XLogFileName(PG_TLI, segno, pg_constants::WAL_SEGMENT_SIZE);
let wal_file_path = format!("pg_wal/{}", wal_file_name);
let header = new_tar_header(&wal_file_path, pg_constants::WAL_SEGMENT_SIZE as u64)?;
let wal_seg = generate_wal_segment(&pg_control);
let wal_seg = generate_wal_segment(segno, pg_control.system_identifier);
assert!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE);
self.ar.append(&header, &wal_seg[..])?;
Ok(())

View File

@@ -4,13 +4,16 @@
use anyhow::Result;
use clap::{App, Arg};
use pageserver::layered_repository::dump_layerfile_from_path;
use pageserver::virtual_file;
use std::path::PathBuf;
use zenith_utils::GIT_VERSION;
fn main() -> Result<()> {
let arg_matches = App::new("Zenith dump_layerfile utility")
.about("Dump contents of one layer file, for debugging")
.version(GIT_VERSION)
.arg(
Arg::with_name("path")
Arg::new("path")
.help("Path to file to dump")
.required(true)
.index(1),
@@ -19,6 +22,9 @@ fn main() -> Result<()> {
let path = PathBuf::from(arg_matches.value_of("path").unwrap());
// Basic initialization of things that don't change after startup
virtual_file::init(10);
dump_layerfile_from_path(&path)?;
Ok(())

View File

@@ -1,403 +1,167 @@
//
// Main entry point for the Page Server executable
//
//! Main entry point for the Page Server executable.
use log::*;
use pageserver::defaults::*;
use serde::{Deserialize, Serialize};
use std::{
env,
net::TcpListener,
path::{Path, PathBuf},
process::exit,
str::FromStr,
thread,
};
use zenith_utils::{auth::JwtAuth, logging, postgres_backend::AuthType};
use std::{env, path::Path, str::FromStr};
use tracing::*;
use zenith_utils::{auth::JwtAuth, logging, postgres_backend::AuthType, tcp_listener, GIT_VERSION};
use anyhow::{bail, ensure, Result};
use clap::{App, Arg, ArgMatches};
use anyhow::{bail, Context, Result};
use clap::{App, Arg};
use daemonize::Daemonize;
use pageserver::{
branches, http, page_service, tenant_mgr, PageServerConf, RelishStorageConfig, S3Config,
LOG_FILE_NAME,
branches,
config::{defaults::*, PageServerConf},
http, page_cache, page_service, remote_storage, tenant_mgr, thread_mgr,
thread_mgr::ThreadKind,
virtual_file, LOG_FILE_NAME,
};
use zenith_utils::http::endpoint;
/// String arguments that can be declared via CLI or config file
#[derive(Serialize, Deserialize)]
struct CfgFileParams {
listen_pg_addr: Option<String>,
listen_http_addr: Option<String>,
checkpoint_distance: Option<String>,
checkpoint_period: Option<String>,
gc_horizon: Option<String>,
gc_period: Option<String>,
pg_distrib_dir: Option<String>,
auth_validation_public_key_path: Option<String>,
auth_type: Option<String>,
// see https://github.com/alexcrichton/toml-rs/blob/6c162e6562c3e432bf04c82a3d1d789d80761a86/examples/enum_external.rs for enum deserialisation examples
relish_storage: Option<RelishStorage>,
}
#[derive(Serialize, Deserialize, Clone)]
enum RelishStorage {
Local {
local_path: String,
},
AwsS3 {
bucket_name: String,
bucket_region: String,
#[serde(skip_serializing)]
access_key_id: Option<String>,
#[serde(skip_serializing)]
secret_access_key: Option<String>,
},
}
impl CfgFileParams {
/// Extract string arguments from CLI
fn from_args(arg_matches: &ArgMatches) -> Self {
let get_arg = |arg_name: &str| -> Option<String> {
arg_matches.value_of(arg_name).map(str::to_owned)
};
let relish_storage = if let Some(local_path) = get_arg("relish-storage-local-path") {
Some(RelishStorage::Local { local_path })
} else if let Some((bucket_name, bucket_region)) =
get_arg("relish-storage-s3-bucket").zip(get_arg("relish-storage-region"))
{
Some(RelishStorage::AwsS3 {
bucket_name,
bucket_region,
access_key_id: get_arg("relish-storage-access-key"),
secret_access_key: get_arg("relish-storage-secret-access-key"),
})
} else {
None
};
Self {
listen_pg_addr: get_arg("listen-pg"),
listen_http_addr: get_arg("listen-http"),
checkpoint_distance: get_arg("checkpoint_distance"),
checkpoint_period: get_arg("checkpoint_period"),
gc_horizon: get_arg("gc_horizon"),
gc_period: get_arg("gc_period"),
pg_distrib_dir: get_arg("postgres-distrib"),
auth_validation_public_key_path: get_arg("auth-validation-public-key-path"),
auth_type: get_arg("auth-type"),
relish_storage,
}
}
/// Fill missing values in `self` with `other`
fn or(self, other: CfgFileParams) -> Self {
// TODO cleaner way to do this
Self {
listen_pg_addr: self.listen_pg_addr.or(other.listen_pg_addr),
listen_http_addr: self.listen_http_addr.or(other.listen_http_addr),
checkpoint_distance: self.checkpoint_distance.or(other.checkpoint_distance),
checkpoint_period: self.checkpoint_period.or(other.checkpoint_period),
gc_horizon: self.gc_horizon.or(other.gc_horizon),
gc_period: self.gc_period.or(other.gc_period),
pg_distrib_dir: self.pg_distrib_dir.or(other.pg_distrib_dir),
auth_validation_public_key_path: self
.auth_validation_public_key_path
.or(other.auth_validation_public_key_path),
auth_type: self.auth_type.or(other.auth_type),
relish_storage: self.relish_storage.or(other.relish_storage),
}
}
/// Create a PageServerConf from these string parameters
fn try_into_config(&self) -> Result<PageServerConf> {
let workdir = PathBuf::from(".");
let listen_pg_addr = match self.listen_pg_addr.as_ref() {
Some(addr) => addr.clone(),
None => DEFAULT_PG_LISTEN_ADDR.to_owned(),
};
let listen_http_addr = match self.listen_http_addr.as_ref() {
Some(addr) => addr.clone(),
None => DEFAULT_HTTP_LISTEN_ADDR.to_owned(),
};
let checkpoint_distance: u64 = match self.checkpoint_distance.as_ref() {
Some(checkpoint_distance_str) => checkpoint_distance_str.parse()?,
None => DEFAULT_CHECKPOINT_DISTANCE,
};
let checkpoint_period = match self.checkpoint_period.as_ref() {
Some(checkpoint_period_str) => humantime::parse_duration(checkpoint_period_str)?,
None => DEFAULT_CHECKPOINT_PERIOD,
};
let gc_horizon: u64 = match self.gc_horizon.as_ref() {
Some(horizon_str) => horizon_str.parse()?,
None => DEFAULT_GC_HORIZON,
};
let gc_period = match self.gc_period.as_ref() {
Some(period_str) => humantime::parse_duration(period_str)?,
None => DEFAULT_GC_PERIOD,
};
let pg_distrib_dir = match self.pg_distrib_dir.as_ref() {
Some(pg_distrib_dir_str) => PathBuf::from(pg_distrib_dir_str),
None => env::current_dir()?.join("tmp_install"),
};
let auth_validation_public_key_path = self
.auth_validation_public_key_path
.as_ref()
.map(PathBuf::from);
let auth_type = self
.auth_type
.as_ref()
.map_or(Ok(AuthType::Trust), |auth_type| {
AuthType::from_str(auth_type)
})?;
if !pg_distrib_dir.join("bin/postgres").exists() {
bail!("Can't find postgres binary at {:?}", pg_distrib_dir);
}
if auth_type == AuthType::ZenithJWT {
ensure!(
auth_validation_public_key_path.is_some(),
"Missing auth_validation_public_key_path when auth_type is ZenithJWT"
);
let path_ref = auth_validation_public_key_path.as_ref().unwrap();
ensure!(
path_ref.exists(),
format!("Can't find auth_validation_public_key at {:?}", path_ref)
);
}
let relish_storage_config =
self.relish_storage
.as_ref()
.map(|storage_params| match storage_params.clone() {
RelishStorage::Local { local_path } => {
RelishStorageConfig::LocalFs(PathBuf::from(local_path))
}
RelishStorage::AwsS3 {
bucket_name,
bucket_region,
access_key_id,
secret_access_key,
} => RelishStorageConfig::AwsS3(S3Config {
bucket_name,
bucket_region,
access_key_id,
secret_access_key,
}),
});
Ok(PageServerConf {
daemonize: false,
listen_pg_addr,
listen_http_addr,
checkpoint_distance,
checkpoint_period,
gc_horizon,
gc_period,
superuser: String::from(DEFAULT_SUPERUSER),
workdir,
pg_distrib_dir,
auth_validation_public_key_path,
auth_type,
relish_storage_config,
})
}
}
use zenith_utils::postgres_backend;
use zenith_utils::shutdown::exit_now;
use zenith_utils::signals::{self, Signal};
fn main() -> Result<()> {
zenith_metrics::set_common_metrics_prefix("pageserver");
let arg_matches = App::new("Zenith page server")
.about("Materializes WAL stream to pages and serves them to the postgres")
.version(GIT_VERSION)
.arg(
Arg::with_name("listen-pg")
.short("l")
.long("listen-pg")
.alias("listen") // keep some compatibility
.takes_value(true)
.help("listen for incoming page requests on ip:port (default: 127.0.0.1:5430)"),
)
.arg(
Arg::with_name("listen-http")
.long("listen-http")
.alias("http_endpoint") // keep some compatibility
.takes_value(true)
.help("http endpoint address for for metrics and management API calls ip:port (default: 127.0.0.1:5430)"),
)
.arg(
Arg::with_name("daemonize")
.short("d")
Arg::new("daemonize")
.short('d')
.long("daemonize")
.takes_value(false)
.help("Run in the background"),
)
.arg(
Arg::with_name("init")
Arg::new("init")
.long("init")
.takes_value(false)
.help("Initialize pageserver repo"),
)
.arg(
Arg::with_name("checkpoint_distance")
.long("checkpoint_distance")
.takes_value(true)
.help("Distance from current LSN to perform checkpoint of in-memory layers"),
)
.arg(
Arg::with_name("checkpoint_period")
.long("checkpoint_period")
.takes_value(true)
.help("Interval between checkpoint iterations"),
)
.arg(
Arg::with_name("gc_horizon")
.long("gc_horizon")
.takes_value(true)
.help("Distance from current LSN to perform all wal records cleanup"),
)
.arg(
Arg::with_name("gc_period")
.long("gc_period")
.takes_value(true)
.help("Interval between garbage collector iterations"),
)
.arg(
Arg::with_name("workdir")
.short("D")
Arg::new("workdir")
.short('D')
.long("workdir")
.takes_value(true)
.help("Working directory for the pageserver"),
)
.arg(
Arg::with_name("postgres-distrib")
.long("postgres-distrib")
.takes_value(true)
.help("Postgres distribution directory"),
)
.arg(
Arg::with_name("create-tenant")
Arg::new("create-tenant")
.long("create-tenant")
.takes_value(true)
.help("Create tenant during init")
.requires("init"),
)
// See `settings.md` for more details on the extra configuration patameters pageserver can process
.arg(
Arg::with_name("auth-validation-public-key-path")
.long("auth-validation-public-key-path")
Arg::new("config-override")
.short('c')
.takes_value(true)
.help("Path to public key used to validate jwt signature"),
)
.arg(
Arg::with_name("auth-type")
.long("auth-type")
.takes_value(true)
.help("Authentication scheme type. One of: Trust, MD5, ZenithJWT"),
)
.arg(
Arg::with_name("relish-storage-local-path")
.long("relish-storage-local-path")
.takes_value(true)
.help("Path to the local directory, to be used as an external relish storage")
.conflicts_with_all(&[
"relish-storage-s3-bucket",
"relish-storage-region",
"relish-storage-access-key",
"relish-storage-secret-access-key",
]),
)
.arg(
Arg::with_name("relish-storage-s3-bucket")
.long("relish-storage-s3-bucket")
.takes_value(true)
.help("Name of the AWS S3 bucket to use an external relish storage")
.requires("relish-storage-region"),
)
.arg(
Arg::with_name("relish-storage-region")
.long("relish-storage-region")
.takes_value(true)
.help("Region of the AWS S3 bucket"),
)
.arg(
Arg::with_name("relish-storage-access-key")
.long("relish-storage-access-key")
.takes_value(true)
.help("Credentials to access the AWS S3 bucket"),
)
.arg(
Arg::with_name("relish-storage-secret-access-key")
.long("relish-storage-secret-access-key")
.takes_value(true)
.help("Credentials to access the AWS S3 bucket"),
.number_of_values(1)
.multiple_occurrences(true)
.help("Additional configuration overrides of the ones from the toml config file (or new ones to add there).
Any option has to be a valid toml document, example: `-c=\"foo='hey'\"` `-c=\"foo={value=1}\"`"),
)
.get_matches();
let workdir = Path::new(arg_matches.value_of("workdir").unwrap_or(".zenith"));
let cfg_file_path = workdir.canonicalize()?.join("pageserver.toml");
let args_params = CfgFileParams::from_args(&arg_matches);
let workdir = workdir
.canonicalize()
.with_context(|| format!("Error opening workdir '{}'", workdir.display()))?;
let cfg_file_path = workdir.join("pageserver.toml");
let init = arg_matches.is_present("init");
let create_tenant = arg_matches.value_of("create-tenant");
let params = if init {
// Set CWD to workdir for non-daemon modes
env::set_current_dir(&workdir).with_context(|| {
format!(
"Failed to set application's current dir to '{}'",
workdir.display()
)
})?;
let daemonize = arg_matches.is_present("daemonize");
if init && daemonize {
bail!("--daemonize cannot be used with --init")
}
let mut toml = if init {
// We're initializing the repo, so there's no config file yet
args_params
DEFAULT_CONFIG_FILE
.parse::<toml_edit::Document>()
.expect("could not parse built-in config file")
} else {
// Supplement the CLI arguments with the config file
let cfg_file_contents = std::fs::read_to_string(&cfg_file_path)?;
let file_params: CfgFileParams = toml::from_str(&cfg_file_contents)?;
args_params.or(file_params)
let cfg_file_contents = std::fs::read_to_string(&cfg_file_path)
.with_context(|| format!("No pageserver config at '{}'", cfg_file_path.display()))?;
cfg_file_contents
.parse::<toml_edit::Document>()
.with_context(|| {
format!(
"Failed to read '{}' as pageserver config",
cfg_file_path.display()
)
})?
};
// Set CWD to workdir for non-daemon modes
env::set_current_dir(&workdir)?;
// Process any extra options given with -c
if let Some(values) = arg_matches.values_of("config-override") {
for option_line in values {
let doc = toml_edit::Document::from_str(option_line).with_context(|| {
format!(
"Option '{}' could not be parsed as a toml document",
option_line
)
})?;
// Ensure the config is valid, even if just init-ing
let mut conf = params.try_into_config()?;
conf.daemonize = arg_matches.is_present("daemonize");
if init && conf.daemonize {
eprintln!("--daemonize cannot be used with --init");
exit(1);
for (key, item) in doc.iter() {
if key == "id" {
anyhow::ensure!(
init,
"node id can only be set during pageserver init and cannot be overridden"
);
}
toml.insert(key, item.clone());
}
}
}
trace!("Resulting toml: {}", toml);
let conf = PageServerConf::parse_and_validate(&toml, &workdir)
.context("Failed to parse pageserver configuration")?;
// The configuration is all set up now. Turn it into a 'static
// that can be freely stored in structs and passed across threads
// as a ref.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
page_cache::init(conf);
// Create repo and exit if init was requested
if init {
branches::init_pageserver(conf, create_tenant)?;
branches::init_pageserver(conf, create_tenant).context("Failed to init pageserver")?;
// write the config file
let cfg_file_contents = toml::to_string_pretty(&params)?;
// TODO support enable-auth flag
std::fs::write(&cfg_file_path, cfg_file_contents)?;
return Ok(());
std::fs::write(&cfg_file_path, toml.to_string()).with_context(|| {
format!(
"Failed to initialize pageserver config at '{}'",
cfg_file_path.display()
)
})?;
Ok(())
} else {
start_pageserver(conf, daemonize).context("Failed to start pageserver")
}
start_pageserver(conf)
}
fn start_pageserver(conf: &'static PageServerConf) -> Result<()> {
fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()> {
// Initialize logger
let (_scope_guard, log_file) = logging::init(LOG_FILE_NAME, conf.daemonize)?;
let log_file = logging::init(LOG_FILE_NAME, daemonize)?;
info!("version: {}", GIT_VERSION);
// TODO: Check that it looks like a valid repository before going further
@@ -406,15 +170,16 @@ fn start_pageserver(conf: &'static PageServerConf) -> Result<()> {
"Starting pageserver http handler on {}",
conf.listen_http_addr
);
let http_listener = TcpListener::bind(conf.listen_http_addr.clone())?;
let http_listener = tcp_listener::bind(conf.listen_http_addr.clone())?;
info!(
"Starting pageserver pg protocol handler on {}",
conf.listen_pg_addr
);
let pageserver_listener = TcpListener::bind(conf.listen_pg_addr.clone())?;
let pageserver_listener = tcp_listener::bind(conf.listen_pg_addr.clone())?;
if conf.daemonize {
// NB: Don't spawn any threads before daemonizing!
if daemonize {
info!("daemonizing...");
// There shouldn't be any logging to stdin/stdout. Redirect it to the main log so
@@ -428,17 +193,22 @@ fn start_pageserver(conf: &'static PageServerConf) -> Result<()> {
.stdout(stdout)
.stderr(stderr);
match daemonize.start() {
// XXX: The parent process should exit abruptly right after
// it has spawned a child to prevent coverage machinery from
// dumping stats into a `profraw` file now owned by the child.
// Otherwise, the coverage data will be damaged.
match daemonize.exit_action(|| exit_now(0)).start() {
Ok(_) => info!("Success, daemonized"),
Err(e) => error!("Error, {}", e),
Err(err) => error!(%err, "could not daemonize"),
}
}
// Initialize tenant manager.
tenant_mgr::init(conf);
let signals = signals::install_shutdown_handlers()?;
let sync_startup = remote_storage::start_local_timeline_sync(conf)
.context("Failed to set up local files sync with external storage")?;
// keep join handles for spawned threads
let mut join_handles = vec![];
// Initialize tenant manager.
tenant_mgr::set_timeline_states(conf, sync_startup.initial_timeline_states);
// initialize authentication for incoming connections
let auth = match &conf.auth_type {
@@ -453,31 +223,74 @@ fn start_pageserver(conf: &'static PageServerConf) -> Result<()> {
// Spawn a new thread for the http endpoint
// bind before launching separate thread so the error reported before startup exits
let cloned = auth.clone();
let http_endpoint_thread = thread::Builder::new()
.name("http_endpoint_thread".into())
.spawn(move || {
let router = http::make_router(conf, cloned);
endpoint::serve_thread_main(router, http_listener)
})?;
let auth_cloned = auth.clone();
thread_mgr::spawn(
ThreadKind::HttpEndpointListener,
None,
None,
"http_endpoint_thread",
move || {
let router = http::make_router(conf, auth_cloned);
endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher())
},
)?;
join_handles.push(http_endpoint_thread);
// Spawn a thread to listen for connections. It will spawn further threads
// Spawn a thread to listen for libpq connections. It will spawn further threads
// for each connection.
let page_service_thread = thread::Builder::new()
.name("Page Service thread".into())
.spawn(move || {
page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type)
})?;
thread_mgr::spawn(
ThreadKind::LibpqEndpointListener,
None,
None,
"libpq endpoint thread",
move || page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type),
)?;
join_handles.push(page_service_thread);
signals.handle(|signal| match signal {
Signal::Quit => {
info!(
"Got {}. Terminating in immediate shutdown mode",
signal.name()
);
std::process::exit(111);
}
for handle in join_handles.into_iter() {
handle
.join()
.expect("thread panicked")
.expect("thread exited with an error")
}
Ok(())
Signal::Interrupt | Signal::Terminate => {
info!(
"Got {}. Terminating gracefully in fast shutdown mode",
signal.name()
);
shutdown_pageserver();
unreachable!()
}
})
}
fn shutdown_pageserver() {
// Shut down the libpq endpoint thread. This prevents new connections from
// being accepted.
thread_mgr::shutdown_threads(Some(ThreadKind::LibpqEndpointListener), None, None);
// Shut down any page service threads.
postgres_backend::set_pgbackend_shutdown_requested();
thread_mgr::shutdown_threads(Some(ThreadKind::PageRequestHandler), None, None);
// Shut down all the tenants. This flushes everything to disk and kills
// the checkpoint and GC threads.
tenant_mgr::shutdown_all_tenants();
// Stop syncing with remote storage.
//
// FIXME: Does this wait for the sync thread to finish syncing what's queued up?
// Should it?
thread_mgr::shutdown_threads(Some(ThreadKind::StorageSync), None, None);
// Shut down the HTTP endpoint last, so that you can still check the server's
// status while it's shutting down.
thread_mgr::shutdown_threads(Some(ThreadKind::HttpEndpointListener), None, None);
// There should be nothing left, but let's be sure
thread_mgr::shutdown_threads(None, None, None);
info!("Shut down successfully completed");
std::process::exit(0);
}

View File

@@ -0,0 +1,334 @@
//! A CLI helper to deal with remote storage (S3, usually) blobs as archives.
//! See [`compression`] for more details about the archives.
use std::{collections::BTreeSet, path::Path};
use anyhow::{bail, ensure, Context};
use clap::{App, Arg};
use pageserver::{
layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME},
remote_storage::compression,
};
use tokio::{fs, io};
use zenith_utils::GIT_VERSION;
const LIST_SUBCOMMAND: &str = "list";
const ARCHIVE_ARG_NAME: &str = "archive";
const EXTRACT_SUBCOMMAND: &str = "extract";
const TARGET_DIRECTORY_ARG_NAME: &str = "target_directory";
const CREATE_SUBCOMMAND: &str = "create";
const SOURCE_DIRECTORY_ARG_NAME: &str = "source_directory";
#[tokio::main(flavor = "current_thread")]
async fn main() -> anyhow::Result<()> {
let arg_matches = App::new("pageserver zst blob [un]compressor utility")
.version(GIT_VERSION)
.subcommands(vec![
App::new(LIST_SUBCOMMAND)
.about("List the archive contents")
.arg(
Arg::new(ARCHIVE_ARG_NAME)
.required(true)
.takes_value(true)
.help("An archive to list the contents of"),
),
App::new(EXTRACT_SUBCOMMAND)
.about("Extracts the archive into the directory")
.arg(
Arg::new(ARCHIVE_ARG_NAME)
.required(true)
.takes_value(true)
.help("An archive to extract"),
)
.arg(
Arg::new(TARGET_DIRECTORY_ARG_NAME)
.required(false)
.takes_value(true)
.help("A directory to extract the archive into. Optional, will use the current directory if not specified"),
),
App::new(CREATE_SUBCOMMAND)
.about("Creates an archive with the contents of a directory (only the first level files are taken, metadata file has to be present in the same directory)")
.arg(
Arg::new(SOURCE_DIRECTORY_ARG_NAME)
.required(true)
.takes_value(true)
.help("A directory to use for creating the archive"),
)
.arg(
Arg::new(TARGET_DIRECTORY_ARG_NAME)
.required(false)
.takes_value(true)
.help("A directory to create the archive in. Optional, will use the current directory if not specified"),
),
])
.get_matches();
let subcommand_name = match arg_matches.subcommand_name() {
Some(name) => name,
None => bail!("No subcommand specified"),
};
let subcommand_matches = match arg_matches.subcommand_matches(subcommand_name) {
Some(matches) => matches,
None => bail!(
"No subcommand arguments were recognized for subcommand '{}'",
subcommand_name
),
};
let target_dir = Path::new(
subcommand_matches
.value_of(TARGET_DIRECTORY_ARG_NAME)
.unwrap_or("./"),
);
match subcommand_name {
LIST_SUBCOMMAND => {
let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) {
Some(archive) => Path::new(archive),
None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME),
};
list_archive(archive).await
}
EXTRACT_SUBCOMMAND => {
let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) {
Some(archive) => Path::new(archive),
None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME),
};
extract_archive(archive, target_dir).await
}
CREATE_SUBCOMMAND => {
let source_dir = match subcommand_matches.value_of(SOURCE_DIRECTORY_ARG_NAME) {
Some(source) => Path::new(source),
None => bail!("No '{}' argument is specified", SOURCE_DIRECTORY_ARG_NAME),
};
create_archive(source_dir, target_dir).await
}
unknown => bail!("Unknown subcommand {}", unknown),
}
}
async fn list_archive(archive: &Path) -> anyhow::Result<()> {
let archive = archive.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the archive path '{}'",
archive.display()
)
})?;
ensure!(
archive.is_file(),
"Path '{}' is not an archive file",
archive.display()
);
println!("Listing an archive at path '{}'", archive.display());
let archive_name = match archive.file_name().and_then(|name| name.to_str()) {
Some(name) => name,
None => bail!(
"Failed to get the archive name from the path '{}'",
archive.display()
),
};
let archive_bytes = fs::read(&archive)
.await
.context("Failed to read the archive bytes")?;
let header = compression::read_archive_header(archive_name, &mut archive_bytes.as_slice())
.await
.context("Failed to read the archive header")?;
let empty_path = Path::new("");
println!("-------------------------------");
let longest_path_in_archive = header
.files
.iter()
.filter_map(|file| Some(file.subpath.as_path(empty_path).to_str()?.len()))
.max()
.unwrap_or_default()
.max(METADATA_FILE_NAME.len());
for regular_file in &header.files {
println!(
"File: {:width$} uncompressed size: {} bytes",
regular_file.subpath.as_path(empty_path).display(),
regular_file.size,
width = longest_path_in_archive,
)
}
println!(
"File: {:width$} uncompressed size: {} bytes",
METADATA_FILE_NAME,
header.metadata_file_size,
width = longest_path_in_archive,
);
println!("-------------------------------");
Ok(())
}
async fn extract_archive(archive: &Path, target_dir: &Path) -> anyhow::Result<()> {
let archive = archive.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the archive path '{}'",
archive.display()
)
})?;
ensure!(
archive.is_file(),
"Path '{}' is not an archive file",
archive.display()
);
let archive_name = match archive.file_name().and_then(|name| name.to_str()) {
Some(name) => name,
None => bail!(
"Failed to get the archive name from the path '{}'",
archive.display()
),
};
if !target_dir.exists() {
fs::create_dir_all(target_dir).await.with_context(|| {
format!(
"Failed to create the target dir at path '{}'",
target_dir.display()
)
})?;
}
let target_dir = target_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the target dir path '{}'",
target_dir.display()
)
})?;
ensure!(
target_dir.is_dir(),
"Path '{}' is not a directory",
target_dir.display()
);
let mut dir_contents = fs::read_dir(&target_dir)
.await
.context("Failed to list the target directory contents")?;
let dir_entry = dir_contents
.next_entry()
.await
.context("Failed to list the target directory contents")?;
ensure!(
dir_entry.is_none(),
"Target directory '{}' is not empty",
target_dir.display()
);
println!(
"Extracting an archive at path '{}' into directory '{}'",
archive.display(),
target_dir.display()
);
let mut archive_file = fs::File::open(&archive).await.with_context(|| {
format!(
"Failed to get the archive name from the path '{}'",
archive.display()
)
})?;
let header = compression::read_archive_header(archive_name, &mut archive_file)
.await
.context("Failed to read the archive header")?;
compression::uncompress_with_header(&BTreeSet::new(), &target_dir, header, &mut archive_file)
.await
.context("Failed to extract the archive")
}
async fn create_archive(source_dir: &Path, target_dir: &Path) -> anyhow::Result<()> {
let source_dir = source_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the source dir path '{}'",
source_dir.display()
)
})?;
ensure!(
source_dir.is_dir(),
"Path '{}' is not a directory",
source_dir.display()
);
if !target_dir.exists() {
fs::create_dir_all(target_dir).await.with_context(|| {
format!(
"Failed to create the target dir at path '{}'",
target_dir.display()
)
})?;
}
let target_dir = target_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the target dir path '{}'",
target_dir.display()
)
})?;
ensure!(
target_dir.is_dir(),
"Path '{}' is not a directory",
target_dir.display()
);
println!(
"Compressing directory '{}' and creating resulting archive in directory '{}'",
source_dir.display(),
target_dir.display()
);
let mut metadata_file_contents = None;
let mut files_co_archive = Vec::new();
let mut source_dir_contents = fs::read_dir(&source_dir)
.await
.context("Failed to read the source directory contents")?;
while let Some(source_dir_entry) = source_dir_contents
.next_entry()
.await
.context("Failed to read a source dir entry")?
{
let entry_path = source_dir_entry.path();
if entry_path.is_file() {
if entry_path.file_name().and_then(|name| name.to_str()) == Some(METADATA_FILE_NAME) {
let metadata_bytes = fs::read(entry_path)
.await
.context("Failed to read metata file bytes in the source dir")?;
metadata_file_contents = Some(
TimelineMetadata::from_bytes(&metadata_bytes)
.context("Failed to parse metata file contents in the source dir")?,
);
} else {
files_co_archive.push(entry_path);
}
}
}
let metadata = match metadata_file_contents {
Some(metadata) => metadata,
None => bail!(
"No metadata file found in the source dir '{}', cannot create the archive",
source_dir.display()
),
};
let _ = compression::archive_files_as_stream(
&source_dir,
files_co_archive.iter(),
&metadata,
move |mut archive_streamer, archive_name| async move {
let archive_target = target_dir.join(&archive_name);
let mut archive_file = fs::File::create(&archive_target).await?;
io::copy(&mut archive_streamer, &mut archive_file).await?;
Ok(archive_target)
},
)
.await
.context("Failed to create an archive")?;
Ok(())
}

View File

@@ -0,0 +1,72 @@
//! Main entry point for the edit_metadata executable
//!
//! A handy tool for debugging, that's all.
use anyhow::Result;
use clap::{App, Arg};
use pageserver::layered_repository::metadata::TimelineMetadata;
use std::path::PathBuf;
use std::str::FromStr;
use zenith_utils::lsn::Lsn;
use zenith_utils::GIT_VERSION;
fn main() -> Result<()> {
let arg_matches = App::new("Zenith update metadata utility")
.about("Dump or update metadata file")
.version(GIT_VERSION)
.arg(
Arg::new("path")
.help("Path to metadata file")
.required(true),
)
.arg(
Arg::new("disk_lsn")
.short('d')
.long("disk_lsn")
.takes_value(true)
.help("Replace disk constistent lsn"),
)
.arg(
Arg::new("prev_lsn")
.short('p')
.long("prev_lsn")
.takes_value(true)
.help("Previous record LSN"),
)
.get_matches();
let path = PathBuf::from(arg_matches.value_of("path").unwrap());
let metadata_bytes = std::fs::read(&path)?;
let mut meta = TimelineMetadata::from_bytes(&metadata_bytes)?;
println!("Current metadata:\n{:?}", &meta);
let mut update_meta = false;
if let Some(disk_lsn) = arg_matches.value_of("disk_lsn") {
meta = TimelineMetadata::new(
Lsn::from_str(disk_lsn)?,
meta.prev_record_lsn(),
meta.ancestor_timeline(),
meta.ancestor_lsn(),
meta.latest_gc_cutoff_lsn(),
meta.initdb_lsn(),
);
update_meta = true;
}
if let Some(prev_lsn) = arg_matches.value_of("prev_lsn") {
meta = TimelineMetadata::new(
meta.disk_consistent_lsn(),
Some(Lsn::from_str(prev_lsn)?),
meta.ancestor_timeline(),
meta.ancestor_lsn(),
meta.latest_gc_cutoff_lsn(),
meta.initdb_lsn(),
);
update_meta = true;
}
if update_meta {
let metadata_bytes = meta.to_bytes()?;
std::fs::write(&path, &metadata_bytes)?;
}
Ok(())
}

View File

@@ -4,7 +4,7 @@
// TODO: move all paths construction to conf impl
//
use anyhow::{bail, ensure, Context, Result};
use anyhow::{bail, Context, Result};
use postgres_ffi::ControlFileData;
use serde::{Deserialize, Serialize};
use std::{
@@ -14,16 +14,18 @@ use std::{
str::FromStr,
sync::Arc,
};
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use tracing::*;
use log::*;
use zenith_utils::crashsafe_dir;
use zenith_utils::logging;
use zenith_utils::lsn::Lsn;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use crate::tenant_mgr;
use crate::walredo::WalRedoManager;
use crate::{repository::Repository, PageServerConf};
use crate::{restore_local_repo, LOG_FILE_NAME};
use crate::CheckpointConfig;
use crate::{config::PageServerConf, repository::Repository};
use crate::{import_datadir, LOG_FILE_NAME};
use crate::{repository::RepositoryTimeline, tenant_mgr};
#[derive(Serialize, Deserialize, Clone)]
pub struct BranchInfo {
@@ -34,48 +36,49 @@ pub struct BranchInfo {
pub ancestor_id: Option<String>,
pub ancestor_lsn: Option<String>,
pub current_logical_size: usize,
pub current_logical_size_non_incremental: usize,
pub current_logical_size_non_incremental: Option<usize>,
}
impl BranchInfo {
pub fn from_path<T: AsRef<Path>>(
path: T,
conf: &PageServerConf,
tenantid: &ZTenantId,
repo: &Arc<dyn Repository>,
include_non_incremental_logical_size: bool,
) -> Result<Self> {
let name = path
.as_ref()
.file_name()
.unwrap()
.to_str()
.unwrap()
.to_string();
let timeline_id = std::fs::read_to_string(path)?.parse::<ZTimelineId>()?;
let path = path.as_ref();
let name = path.file_name().unwrap().to_string_lossy().to_string();
let timeline_id = std::fs::read_to_string(path)
.with_context(|| {
format!(
"Failed to read branch file contents at path '{}'",
path.display()
)
})?
.parse::<ZTimelineId>()?;
let timeline = repo.get_timeline(timeline_id)?;
let timeline = match repo.get_timeline(timeline_id)? {
RepositoryTimeline::Local(local_entry) => local_entry,
RepositoryTimeline::Remote { .. } => {
bail!("Timeline {} is remote, no branches to display", timeline_id)
}
};
let ancestor_path = conf.ancestor_path(&timeline_id, tenantid);
let mut ancestor_id: Option<String> = None;
let mut ancestor_lsn: Option<String> = None;
// we use ancestor lsn zero if we don't have an ancestor, so turn this into an option based on timeline id
let (ancestor_id, ancestor_lsn) = match timeline.get_ancestor_timeline_id() {
Some(ancestor_id) => (
Some(ancestor_id.to_string()),
Some(timeline.get_ancestor_lsn().to_string()),
),
None => (None, None),
};
if ancestor_path.exists() {
let ancestor = std::fs::read_to_string(ancestor_path)?;
let mut strings = ancestor.split('@');
ancestor_id = Some(
strings
.next()
.with_context(|| "wrong branch ancestor point in time format")?
.to_owned(),
);
ancestor_lsn = Some(
strings
.next()
.with_context(|| "wrong branch ancestor point in time format")?
.to_owned(),
);
}
// non incremental size calculation can be heavy, so let it be optional
// needed for tests to check size calculation
let current_logical_size_non_incremental = include_non_incremental_logical_size
.then(|| {
timeline.get_current_logical_size_non_incremental(timeline.get_last_record_lsn())
})
.transpose()?;
Ok(BranchInfo {
name,
@@ -84,8 +87,7 @@ impl BranchInfo {
ancestor_id,
ancestor_lsn,
current_logical_size: timeline.get_current_logical_size(),
current_logical_size_non_incremental: timeline
.get_current_logical_size_non_incremental(timeline.get_last_record_lsn())?,
current_logical_size_non_incremental,
})
}
}
@@ -99,7 +101,7 @@ pub struct PointInTime {
pub fn init_pageserver(conf: &'static PageServerConf, create_tenant: Option<&str>) -> Result<()> {
// Initialize logger
// use true as daemonize parameter because otherwise we pollute zenith cli output with a few pages long output of info messages
let (_scope_guard, _log_file) = logging::init(LOG_FILE_NAME, true)?;
let _log_file = logging::init(LOG_FILE_NAME, true)?;
// We don't use the real WAL redo manager, because we don't want to spawn the WAL redo
// process during repository initialization.
@@ -116,9 +118,9 @@ pub fn init_pageserver(conf: &'static PageServerConf, create_tenant: Option<&str
if let Some(tenantid) = create_tenant {
let tenantid = ZTenantId::from_str(tenantid)?;
println!("initializing tenantid {}", tenantid);
create_repo(conf, tenantid, dummy_redo_mgr).with_context(|| "failed to create repo")?;
create_repo(conf, tenantid, dummy_redo_mgr).context("failed to create repo")?;
}
fs::create_dir_all(conf.tenants_path())?;
crashsafe_dir::create_dir_all(conf.tenants_path())?;
println!("pageserver init succeeded");
Ok(())
@@ -135,27 +137,32 @@ pub fn create_repo(
}
// top-level dir may exist if we are creating it through CLI
fs::create_dir_all(&repo_dir)
crashsafe_dir::create_dir_all(&repo_dir)
.with_context(|| format!("could not create directory {}", repo_dir.display()))?;
fs::create_dir(conf.timelines_path(&tenantid))?;
fs::create_dir_all(conf.branches_path(&tenantid))?;
fs::create_dir_all(conf.tags_path(&tenantid))?;
crashsafe_dir::create_dir(conf.timelines_path(&tenantid))?;
crashsafe_dir::create_dir_all(conf.branches_path(&tenantid))?;
crashsafe_dir::create_dir_all(conf.tags_path(&tenantid))?;
info!("created directory structure in {}", repo_dir.display());
let tli = create_timeline(conf, None, &tenantid)?;
// create a new timeline directory
let timeline_id = ZTimelineId::generate();
let timelinedir = conf.timeline_path(&timeline_id, &tenantid);
crashsafe_dir::create_dir(&timelinedir)?;
let repo = Arc::new(crate::layered_repository::LayeredRepository::new(
conf,
wal_redo_manager,
tenantid,
conf.remote_storage_config.is_some(),
));
// Load data into pageserver
// TODO To implement zenith import we need to
// move data loading out of create_repo()
bootstrap_timeline(conf, tenantid, tli, &*repo)?;
bootstrap_timeline(conf, tenantid, timeline_id, repo.as_ref())?;
Ok(repo)
}
@@ -174,26 +181,29 @@ fn get_lsn_from_controlfile(path: &Path) -> Result<Lsn> {
// to get bootstrap data for timeline initialization.
//
fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> {
info!("running initdb... ");
info!("running initdb in {}... ", initdbpath.display());
let initdb_path = conf.pg_bin_dir().join("initdb");
let initdb_output = Command::new(initdb_path)
.args(&["-D", initdbpath.to_str().unwrap()])
.args(&["-U", &conf.superuser])
.args(&["-E", "utf8"])
.arg("--no-instructions")
// This is only used for a temporary installation that is deleted shortly after,
// so no need to fsync it
.arg("--no-sync")
.env_clear()
.env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.stdout(Stdio::null())
.output()
.with_context(|| "failed to execute initdb")?;
.context("failed to execute initdb")?;
if !initdb_output.status.success() {
anyhow::bail!(
"initdb failed: '{}'",
String::from_utf8_lossy(&initdb_output.stderr)
);
}
info!("initdb succeeded");
Ok(())
}
@@ -208,6 +218,8 @@ fn bootstrap_timeline(
tli: ZTimelineId,
repo: &dyn Repository,
) -> Result<()> {
let _enter = info_span!("bootstrapping", timeline = %tli, tenant = %tenantid).entered();
let initdb_path = conf.tenant_path(&tenantid).join("tmp");
// Init temporarily repo to get bootstrap data
@@ -216,13 +228,17 @@ fn bootstrap_timeline(
let lsn = get_lsn_from_controlfile(&pgdata_path)?.align();
info!("bootstrap_timeline {:?} at lsn {}", pgdata_path, lsn);
// Import the contents of the data directory at the initial checkpoint
// LSN, and any WAL after that.
let timeline = repo.create_empty_timeline(tli)?;
restore_local_repo::import_timeline_from_postgres_datadir(&pgdata_path, &*timeline, lsn)?;
timeline.checkpoint()?;
// Initdb lsn will be equal to last_record_lsn which will be set after import.
// Because we know it upfront avoid having an option or dummy zero value by passing it to create_empty_timeline.
let timeline = repo.create_empty_timeline(tli, lsn)?;
import_datadir::import_timeline_from_postgres_datadir(
&pgdata_path,
timeline.writer().as_ref(),
lsn,
)?;
timeline.checkpoint(CheckpointConfig::Forced)?;
println!(
"created initial timeline {} timeline.lsn {}",
@@ -240,29 +256,38 @@ fn bootstrap_timeline(
Ok(())
}
pub(crate) fn get_tenants(conf: &PageServerConf) -> Result<Vec<String>> {
let tenants_dir = conf.tenants_path();
std::fs::read_dir(&tenants_dir)?
.map(|dir_entry_res| {
let dir_entry = dir_entry_res?;
ensure!(dir_entry.file_type()?.is_dir());
Ok(dir_entry.file_name().to_str().unwrap().to_owned())
})
.collect()
}
pub(crate) fn get_branches(conf: &PageServerConf, tenantid: &ZTenantId) -> Result<Vec<BranchInfo>> {
pub(crate) fn get_branches(
conf: &PageServerConf,
tenantid: &ZTenantId,
include_non_incremental_logical_size: bool,
) -> Result<Vec<BranchInfo>> {
let repo = tenant_mgr::get_repository_for_tenant(*tenantid)?;
// Each branch has a corresponding record (text file) in the refs/branches
// with timeline_id.
let branches_dir = conf.branches_path(tenantid);
std::fs::read_dir(&branches_dir)?
std::fs::read_dir(&branches_dir)
.with_context(|| {
format!(
"Found no branches directory '{}' for tenant {}",
branches_dir.display(),
tenantid
)
})?
.map(|dir_entry_res| {
let dir_entry = dir_entry_res?;
BranchInfo::from_path(dir_entry.path(), conf, tenantid, &repo)
let dir_entry = dir_entry_res.with_context(|| {
format!(
"Failed to list branches directory '{}' content for tenant {}",
branches_dir.display(),
tenantid
)
})?;
BranchInfo::from_path(
dir_entry.path(),
&repo,
include_non_incremental_logical_size,
)
})
.collect()
}
@@ -280,7 +305,10 @@ pub(crate) fn create_branch(
}
let mut startpoint = parse_point_in_time(conf, startpoint_str, tenantid)?;
let timeline = repo.get_timeline(startpoint.timelineid)?;
let timeline = repo
.get_timeline(startpoint.timelineid)?
.local_timeline()
.context("Cannot branch off the timeline that's not present locally")?;
if startpoint.lsn == Lsn(0) {
// Find end of WAL on the old timeline
let end_of_wal = timeline.get_last_record_lsn();
@@ -296,35 +324,36 @@ pub(crate) fn create_branch(
timeline.wait_lsn(startpoint.lsn)?;
}
startpoint.lsn = startpoint.lsn.align();
if timeline.get_start_lsn() > startpoint.lsn {
if timeline.get_ancestor_lsn() > startpoint.lsn {
// can we safely just branch from the ancestor instead?
anyhow::bail!(
"invalid startpoint {} for the branch {}: less than timeline start {}",
"invalid startpoint {} for the branch {}: less than timeline ancestor lsn {:?}",
startpoint.lsn,
branchname,
timeline.get_start_lsn()
timeline.get_ancestor_lsn()
);
}
// create a new timeline directory for it
let newtli = create_timeline(conf, Some(startpoint), tenantid)?;
let new_timeline_id = ZTimelineId::generate();
// Let the Repository backend do its initialization
repo.branch_timeline(startpoint.timelineid, newtli, startpoint.lsn)?;
// Forward entire timeline creation routine to repository
// backend, so it can do all needed initialization
repo.branch_timeline(startpoint.timelineid, new_timeline_id, startpoint.lsn)?;
// Remember the human-readable branch name for the new timeline.
// FIXME: there's a race condition, if you create a branch with the same
// name concurrently.
let data = newtli.to_string();
let data = new_timeline_id.to_string();
fs::write(conf.branch_path(branchname, tenantid), data)?;
Ok(BranchInfo {
name: branchname.to_string(),
timeline_id: newtli,
timeline_id: new_timeline_id,
latest_valid_lsn: startpoint.lsn,
ancestor_id: None,
ancestor_lsn: None,
ancestor_id: Some(startpoint.timelineid.to_string()),
ancestor_lsn: Some(startpoint.lsn.to_string()),
current_logical_size: 0,
current_logical_size_non_incremental: 0,
current_logical_size_non_incremental: Some(0),
})
}
@@ -355,14 +384,11 @@ fn parse_point_in_time(
let mut strings = s.split('@');
let name = strings.next().unwrap();
let lsn: Option<Lsn>;
if let Some(lsnstr) = strings.next() {
lsn = Some(
Lsn::from_str(lsnstr).with_context(|| "invalid LSN in point-in-time specification")?,
);
} else {
lsn = None
}
let lsn = strings
.next()
.map(Lsn::from_str)
.transpose()
.context("invalid LSN in point-in-time specification")?;
// Check if it's a tag
if lsn.is_none() {
@@ -400,25 +426,3 @@ fn parse_point_in_time(
bail!("could not parse point-in-time {}", s);
}
fn create_timeline(
conf: &PageServerConf,
ancestor: Option<PointInTime>,
tenantid: &ZTenantId,
) -> Result<ZTimelineId> {
// Create initial timeline
let timelineid = ZTimelineId::generate();
let timelinedir = conf.timeline_path(&timelineid, tenantid);
fs::create_dir(&timelinedir)?;
fs::create_dir(&timelinedir.join("wal"))?;
if let Some(ancestor) = ancestor {
let data = format!("{}@{}", ancestor.timelineid, ancestor.lsn);
fs::write(timelinedir.join("ancestor"), data)?;
}
Ok(timelineid)
}

853
pageserver/src/config.rs Normal file
View File

@@ -0,0 +1,853 @@
//! Functions for handling page server configuration options
//!
//! Configuration options can be set in the pageserver.toml configuration
//! file, or on the command line.
//! See also `settings.md` for better description on every parameter.
use anyhow::{bail, ensure, Context, Result};
use toml_edit;
use toml_edit::{Document, Item};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId};
use std::convert::TryInto;
use std::env;
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::{Path, PathBuf};
use std::str::FromStr;
use std::time::Duration;
use crate::layered_repository::TIMELINES_SEGMENT_NAME;
pub mod defaults {
use const_format::formatcp;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}");
pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 9898;
pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}");
// FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB
// would be more appropriate. But a low value forces the code to be exercised more,
// which is good for now to trigger bugs.
pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024;
pub const DEFAULT_CHECKPOINT_PERIOD: &str = "1 s";
pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
pub const DEFAULT_GC_PERIOD: &str = "100 s";
pub const DEFAULT_SUPERUSER: &str = "zenith_admin";
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 100;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100;
///
/// Default built-in configuration file.
///
pub const DEFAULT_CONFIG_FILE: &str = formatcp!(
r###"
# Initial configuration file created by 'pageserver --init'
#listen_pg_addr = '{DEFAULT_PG_LISTEN_ADDR}'
#listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}'
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#checkpoint_period = '{DEFAULT_CHECKPOINT_PERIOD}'
#gc_period = '{DEFAULT_GC_PERIOD}'
#gc_horizon = {DEFAULT_GC_HORIZON}
#max_file_descriptors = {DEFAULT_MAX_FILE_DESCRIPTORS}
# initial superuser role name to use when creating a new tenant
#initial_superuser_name = '{DEFAULT_SUPERUSER}'
# [remote_storage]
"###
);
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PageServerConf {
// Identifier of that particular pageserver so e g safekeepers
// can safely distinguish different pageservers
pub id: ZNodeId,
/// Example (default): 127.0.0.1:64000
pub listen_pg_addr: String,
/// Example (default): 127.0.0.1:9898
pub listen_http_addr: String,
// Flush out an inmemory layer, if it's holding WAL older than this
// This puts a backstop on how much WAL needs to be re-digested if the
// page server crashes.
pub checkpoint_distance: u64,
pub checkpoint_period: Duration,
pub gc_horizon: u64,
pub gc_period: Duration,
pub superuser: String,
pub page_cache_size: usize,
pub max_file_descriptors: usize,
// Repository directory, relative to current working directory.
// Normally, the page server changes the current working directory
// to the repository, and 'workdir' is always '.'. But we don't do
// that during unit testing, because the current directory is global
// to the process but different unit tests work on different
// repositories.
pub workdir: PathBuf,
pub pg_distrib_dir: PathBuf,
pub auth_type: AuthType,
pub auth_validation_public_key_path: Option<PathBuf>,
pub remote_storage_config: Option<RemoteStorageConfig>,
}
// use dedicated enum for builder to better indicate the intention
// and avoid possible confusion with nested options
pub enum BuilderValue<T> {
Set(T),
NotSet,
}
impl<T> BuilderValue<T> {
pub fn ok_or<E>(self, err: E) -> Result<T, E> {
match self {
Self::Set(v) => Ok(v),
Self::NotSet => Err(err),
}
}
}
// needed to simplify config construction
struct PageServerConfigBuilder {
listen_pg_addr: BuilderValue<String>,
listen_http_addr: BuilderValue<String>,
checkpoint_distance: BuilderValue<u64>,
checkpoint_period: BuilderValue<Duration>,
gc_horizon: BuilderValue<u64>,
gc_period: BuilderValue<Duration>,
superuser: BuilderValue<String>,
page_cache_size: BuilderValue<usize>,
max_file_descriptors: BuilderValue<usize>,
workdir: BuilderValue<PathBuf>,
pg_distrib_dir: BuilderValue<PathBuf>,
auth_type: BuilderValue<AuthType>,
//
auth_validation_public_key_path: BuilderValue<Option<PathBuf>>,
remote_storage_config: BuilderValue<Option<RemoteStorageConfig>>,
id: BuilderValue<ZNodeId>,
}
impl Default for PageServerConfigBuilder {
fn default() -> Self {
use self::BuilderValue::*;
use defaults::*;
Self {
listen_pg_addr: Set(DEFAULT_PG_LISTEN_ADDR.to_string()),
listen_http_addr: Set(DEFAULT_HTTP_LISTEN_ADDR.to_string()),
checkpoint_distance: Set(DEFAULT_CHECKPOINT_DISTANCE),
checkpoint_period: Set(humantime::parse_duration(DEFAULT_CHECKPOINT_PERIOD)
.expect("cannot parse default checkpoint period")),
gc_horizon: Set(DEFAULT_GC_HORIZON),
gc_period: Set(humantime::parse_duration(DEFAULT_GC_PERIOD)
.expect("cannot parse default gc period")),
superuser: Set(DEFAULT_SUPERUSER.to_string()),
page_cache_size: Set(DEFAULT_PAGE_CACHE_SIZE),
max_file_descriptors: Set(DEFAULT_MAX_FILE_DESCRIPTORS),
workdir: Set(PathBuf::new()),
pg_distrib_dir: Set(env::current_dir()
.expect("cannot access current directory")
.join("tmp_install")),
auth_type: Set(AuthType::Trust),
auth_validation_public_key_path: Set(None),
remote_storage_config: Set(None),
id: NotSet,
}
}
}
impl PageServerConfigBuilder {
pub fn listen_pg_addr(&mut self, listen_pg_addr: String) {
self.listen_pg_addr = BuilderValue::Set(listen_pg_addr)
}
pub fn listen_http_addr(&mut self, listen_http_addr: String) {
self.listen_http_addr = BuilderValue::Set(listen_http_addr)
}
pub fn checkpoint_distance(&mut self, checkpoint_distance: u64) {
self.checkpoint_distance = BuilderValue::Set(checkpoint_distance)
}
pub fn checkpoint_period(&mut self, checkpoint_period: Duration) {
self.checkpoint_period = BuilderValue::Set(checkpoint_period)
}
pub fn gc_horizon(&mut self, gc_horizon: u64) {
self.gc_horizon = BuilderValue::Set(gc_horizon)
}
pub fn gc_period(&mut self, gc_period: Duration) {
self.gc_period = BuilderValue::Set(gc_period)
}
pub fn superuser(&mut self, superuser: String) {
self.superuser = BuilderValue::Set(superuser)
}
pub fn page_cache_size(&mut self, page_cache_size: usize) {
self.page_cache_size = BuilderValue::Set(page_cache_size)
}
pub fn max_file_descriptors(&mut self, max_file_descriptors: usize) {
self.max_file_descriptors = BuilderValue::Set(max_file_descriptors)
}
pub fn workdir(&mut self, workdir: PathBuf) {
self.workdir = BuilderValue::Set(workdir)
}
pub fn pg_distrib_dir(&mut self, pg_distrib_dir: PathBuf) {
self.pg_distrib_dir = BuilderValue::Set(pg_distrib_dir)
}
pub fn auth_type(&mut self, auth_type: AuthType) {
self.auth_type = BuilderValue::Set(auth_type)
}
pub fn auth_validation_public_key_path(
&mut self,
auth_validation_public_key_path: Option<PathBuf>,
) {
self.auth_validation_public_key_path = BuilderValue::Set(auth_validation_public_key_path)
}
pub fn remote_storage_config(&mut self, remote_storage_config: Option<RemoteStorageConfig>) {
self.remote_storage_config = BuilderValue::Set(remote_storage_config)
}
pub fn id(&mut self, node_id: ZNodeId) {
self.id = BuilderValue::Set(node_id)
}
pub fn build(self) -> Result<PageServerConf> {
Ok(PageServerConf {
listen_pg_addr: self
.listen_pg_addr
.ok_or(anyhow::anyhow!("missing listen_pg_addr"))?,
listen_http_addr: self
.listen_http_addr
.ok_or(anyhow::anyhow!("missing listen_http_addr"))?,
checkpoint_distance: self
.checkpoint_distance
.ok_or(anyhow::anyhow!("missing checkpoint_distance"))?,
checkpoint_period: self
.checkpoint_period
.ok_or(anyhow::anyhow!("missing checkpoint_period"))?,
gc_horizon: self
.gc_horizon
.ok_or(anyhow::anyhow!("missing gc_horizon"))?,
gc_period: self.gc_period.ok_or(anyhow::anyhow!("missing gc_period"))?,
superuser: self.superuser.ok_or(anyhow::anyhow!("missing superuser"))?,
page_cache_size: self
.page_cache_size
.ok_or(anyhow::anyhow!("missing page_cache_size"))?,
max_file_descriptors: self
.max_file_descriptors
.ok_or(anyhow::anyhow!("missing max_file_descriptors"))?,
workdir: self.workdir.ok_or(anyhow::anyhow!("missing workdir"))?,
pg_distrib_dir: self
.pg_distrib_dir
.ok_or(anyhow::anyhow!("missing pg_distrib_dir"))?,
auth_type: self.auth_type.ok_or(anyhow::anyhow!("missing auth_type"))?,
auth_validation_public_key_path: self
.auth_validation_public_key_path
.ok_or(anyhow::anyhow!("missing auth_validation_public_key_path"))?,
remote_storage_config: self
.remote_storage_config
.ok_or(anyhow::anyhow!("missing remote_storage_config"))?,
id: self.id.ok_or(anyhow::anyhow!("missing id"))?,
})
}
}
/// External backup storage configuration, enough for creating a client for that storage.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between pageserver and the remote storage.
pub max_concurrent_sync: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration.
pub storage: RemoteStorageKind,
}
/// A kind of a remote storage to connect to, with its connection configuration.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum RemoteStorageKind {
/// Storage based on local file system.
/// Specify a root folder to place all stored relish data into.
LocalFs(PathBuf),
/// AWS S3 based storage, storing all relishes into the root
/// of the S3 bucket from the config.
AwsS3(S3Config),
}
/// AWS S3 bucket coordinates and access credentials to manage the bucket contents (read and write).
#[derive(Clone, PartialEq, Eq)]
pub struct S3Config {
/// Name of the bucket to connect to.
pub bucket_name: String,
/// The region where the bucket is located at.
pub bucket_region: String,
/// A "subfolder" in the bucket, to use the same bucket separately by multiple pageservers at once.
pub prefix_in_bucket: Option<String>,
/// "Login" to use when connecting to bucket.
/// Can be empty for cases like AWS k8s IAM
/// where we can allow certain pods to connect
/// to the bucket directly without any credentials.
pub access_key_id: Option<String>,
/// "Password" to use when connecting to bucket.
pub secret_access_key: Option<String>,
/// A base URL to send S3 requests to.
/// By default, the endpoint is derived from a region name, assuming it's
/// an AWS S3 region name, erroring on wrong region name.
/// Endpoint provides a way to support other S3 flavors and their regions.
///
/// Example: `http://127.0.0.1:5000`
pub endpoint: Option<String>,
}
impl std::fmt::Debug for S3Config {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("S3Config")
.field("bucket_name", &self.bucket_name)
.field("bucket_region", &self.bucket_region)
.field("prefix_in_bucket", &self.prefix_in_bucket)
.finish()
}
}
impl PageServerConf {
//
// Repository paths, relative to workdir.
//
pub fn tenants_path(&self) -> PathBuf {
self.workdir.join("tenants")
}
pub fn tenant_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenants_path().join(tenantid.to_string())
}
pub fn tags_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenant_path(tenantid).join("refs").join("tags")
}
pub fn tag_path(&self, tag_name: &str, tenantid: &ZTenantId) -> PathBuf {
self.tags_path(tenantid).join(tag_name)
}
pub fn branches_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenant_path(tenantid).join("refs").join("branches")
}
pub fn branch_path(&self, branch_name: &str, tenantid: &ZTenantId) -> PathBuf {
self.branches_path(tenantid).join(branch_name)
}
pub fn timelines_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenant_path(tenantid).join(TIMELINES_SEGMENT_NAME)
}
pub fn timeline_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timelines_path(tenantid).join(timelineid.to_string())
}
pub fn ancestor_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timeline_path(timelineid, tenantid).join("ancestor")
}
//
// Postgres distribution paths
//
pub fn pg_bin_dir(&self) -> PathBuf {
self.pg_distrib_dir.join("bin")
}
pub fn pg_lib_dir(&self) -> PathBuf {
self.pg_distrib_dir.join("lib")
}
/// Parse a configuration file (pageserver.toml) into a PageServerConf struct,
/// validating the input and failing on errors.
///
/// This leaves any options not present in the file in the built-in defaults.
pub fn parse_and_validate(toml: &Document, workdir: &Path) -> Result<Self> {
let mut builder = PageServerConfigBuilder::default();
builder.workdir(workdir.to_owned());
for (key, item) in toml.iter() {
match key {
"listen_pg_addr" => builder.listen_pg_addr(parse_toml_string(key, item)?),
"listen_http_addr" => builder.listen_http_addr(parse_toml_string(key, item)?),
"checkpoint_distance" => builder.checkpoint_distance(parse_toml_u64(key, item)?),
"checkpoint_period" => builder.checkpoint_period(parse_toml_duration(key, item)?),
"gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?),
"gc_period" => builder.gc_period(parse_toml_duration(key, item)?),
"initial_superuser_name" => builder.superuser(parse_toml_string(key, item)?),
"page_cache_size" => builder.page_cache_size(parse_toml_u64(key, item)? as usize),
"max_file_descriptors" => {
builder.max_file_descriptors(parse_toml_u64(key, item)? as usize)
}
"pg_distrib_dir" => {
builder.pg_distrib_dir(PathBuf::from(parse_toml_string(key, item)?))
}
"auth_validation_public_key_path" => builder.auth_validation_public_key_path(Some(
PathBuf::from(parse_toml_string(key, item)?),
)),
"auth_type" => builder.auth_type(parse_toml_auth_type(key, item)?),
"remote_storage" => {
builder.remote_storage_config(Some(Self::parse_remote_storage_config(item)?))
}
"id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)),
_ => bail!("unrecognized pageserver option '{}'", key),
}
}
let mut conf = builder.build().context("invalid config")?;
if conf.auth_type == AuthType::ZenithJWT {
let auth_validation_public_key_path = conf
.auth_validation_public_key_path
.get_or_insert_with(|| workdir.join("auth_public_key.pem"));
ensure!(
auth_validation_public_key_path.exists(),
format!(
"Can't find auth_validation_public_key at '{}'",
auth_validation_public_key_path.display()
)
);
}
if !conf.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
conf.pg_distrib_dir.display()
);
}
Ok(conf)
}
/// subroutine of parse_config(), to parse the `[remote_storage]` table.
fn parse_remote_storage_config(toml: &toml_edit::Item) -> anyhow::Result<RemoteStorageConfig> {
let local_path = toml.get("local_path");
let bucket_name = toml.get("bucket_name");
let bucket_region = toml.get("bucket_region");
let max_concurrent_sync: NonZeroUsize = if let Some(s) = toml.get("max_concurrent_sync") {
parse_toml_u64("max_concurrent_sync", s)
.and_then(|toml_u64| {
toml_u64.try_into().with_context(|| {
format!("'max_concurrent_sync' value {} is too large", toml_u64)
})
})
.ok()
.and_then(NonZeroUsize::new)
.context("'max_concurrent_sync' must be a non-zero positive integer")?
} else {
NonZeroUsize::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC).unwrap()
};
let max_sync_errors: NonZeroU32 = if let Some(s) = toml.get("max_sync_errors") {
parse_toml_u64("max_sync_errors", s)
.and_then(|toml_u64| {
toml_u64.try_into().with_context(|| {
format!("'max_sync_errors' value {} is too large", toml_u64)
})
})
.ok()
.and_then(NonZeroU32::new)
.context("'max_sync_errors' must be a non-zero positive integer")?
} else {
NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS).unwrap()
};
let storage = match (local_path, bucket_name, bucket_region) {
(None, None, None) => bail!("no 'local_path' nor 'bucket_name' option"),
(_, Some(_), None) => {
bail!("'bucket_region' option is mandatory if 'bucket_name' is given ")
}
(_, None, Some(_)) => {
bail!("'bucket_name' option is mandatory if 'bucket_region' is given ")
}
(None, Some(bucket_name), Some(bucket_region)) => RemoteStorageKind::AwsS3(S3Config {
bucket_name: parse_toml_string("bucket_name", bucket_name)?,
bucket_region: parse_toml_string("bucket_region", bucket_region)?,
access_key_id: toml
.get("access_key_id")
.map(|access_key_id| parse_toml_string("access_key_id", access_key_id))
.transpose()?,
secret_access_key: toml
.get("secret_access_key")
.map(|secret_access_key| {
parse_toml_string("secret_access_key", secret_access_key)
})
.transpose()?,
prefix_in_bucket: toml
.get("prefix_in_bucket")
.map(|prefix_in_bucket| parse_toml_string("prefix_in_bucket", prefix_in_bucket))
.transpose()?,
endpoint: toml
.get("endpoint")
.map(|endpoint| parse_toml_string("endpoint", endpoint))
.transpose()?,
}),
(Some(local_path), None, None) => RemoteStorageKind::LocalFs(PathBuf::from(
parse_toml_string("local_path", local_path)?,
)),
(Some(_), Some(_), _) => bail!("local_path and bucket_name are mutually exclusive"),
};
Ok(RemoteStorageConfig {
max_concurrent_sync,
max_sync_errors,
storage,
})
}
#[cfg(test)]
pub fn test_repo_dir(test_name: &str) -> PathBuf {
PathBuf::from(format!("../tmp_check/test_{}", test_name))
}
#[cfg(test)]
pub fn dummy_conf(repo_dir: PathBuf) -> Self {
PageServerConf {
id: ZNodeId(0),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: Duration::from_secs(10),
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: Duration::from_secs(10),
page_cache_size: defaults::DEFAULT_PAGE_CACHE_SIZE,
max_file_descriptors: defaults::DEFAULT_MAX_FILE_DESCRIPTORS,
listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(),
listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(),
superuser: "zenith_admin".to_string(),
workdir: repo_dir,
pg_distrib_dir: PathBuf::new(),
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
}
}
}
// Helper functions to parse a toml Item
fn parse_toml_string(name: &str, item: &Item) -> Result<String> {
let s = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
Ok(s.to_string())
}
fn parse_toml_u64(name: &str, item: &Item) -> Result<u64> {
// A toml integer is signed, so it cannot represent the full range of an u64. That's OK
// for our use, though.
let i: i64 = item
.as_integer()
.with_context(|| format!("configure option {} is not an integer", name))?;
if i < 0 {
bail!("configure option {} cannot be negative", name);
}
Ok(i as u64)
}
fn parse_toml_duration(name: &str, item: &Item) -> Result<Duration> {
let s = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
Ok(humantime::parse_duration(s)?)
}
fn parse_toml_auth_type(name: &str, item: &Item) -> Result<AuthType> {
let v = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
AuthType::from_str(v)
}
#[cfg(test)]
mod tests {
use std::fs;
use tempfile::{tempdir, TempDir};
use super::*;
const ALL_BASE_VALUES_TOML: &str = r#"
# Initial configuration file created by 'pageserver --init'
listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
checkpoint_distance = 111 # in bytes
checkpoint_period = '111 s'
gc_period = '222 s'
gc_horizon = 222
page_cache_size = 444
max_file_descriptors = 333
# initial superuser role name to use when creating a new tenant
initial_superuser_name = 'zzzz'
id = 10
"#;
#[test]
fn parse_defaults() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
// we have to create dummy pathes to overcome the validation errors
let config_string = format!("pg_distrib_dir='{}'\nid=10", pg_distrib_dir.display());
let toml = config_string.parse()?;
let parsed_config =
PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
});
assert_eq!(
parsed_config,
PageServerConf {
id: ZNodeId(10),
listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(),
listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: humantime::parse_duration(defaults::DEFAULT_CHECKPOINT_PERIOD)?,
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?,
superuser: defaults::DEFAULT_SUPERUSER.to_string(),
page_cache_size: defaults::DEFAULT_PAGE_CACHE_SIZE,
max_file_descriptors: defaults::DEFAULT_MAX_FILE_DESCRIPTORS,
workdir,
pg_distrib_dir,
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
},
"Correct defaults should be used when no config values are provided"
);
Ok(())
}
#[test]
fn parse_basic_config() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let config_string = format!(
"{}pg_distrib_dir='{}'",
ALL_BASE_VALUES_TOML,
pg_distrib_dir.display()
);
let toml = config_string.parse()?;
let parsed_config =
PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
});
assert_eq!(
parsed_config,
PageServerConf {
id: ZNodeId(10),
listen_pg_addr: "127.0.0.1:64000".to_string(),
listen_http_addr: "127.0.0.1:9898".to_string(),
checkpoint_distance: 111,
checkpoint_period: Duration::from_secs(111),
gc_horizon: 222,
gc_period: Duration::from_secs(222),
superuser: "zzzz".to_string(),
page_cache_size: 444,
max_file_descriptors: 333,
workdir,
pg_distrib_dir,
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
},
"Should be able to parse all basic config values correctly"
);
Ok(())
}
#[test]
fn parse_remote_fs_storage_config() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let local_storage_path = tempdir.path().join("local_remote_storage");
let identical_toml_declarations = &[
format!(
r#"[remote_storage]
local_path = '{}'"#,
local_storage_path.display()
),
format!(
"remote_storage={{local_path='{}'}}",
local_storage_path.display()
),
];
for remote_storage_config_str in identical_toml_declarations {
let config_string = format!(
r#"{}
pg_distrib_dir='{}'
{}"#,
ALL_BASE_VALUES_TOML,
pg_distrib_dir.display(),
remote_storage_config_str,
);
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
})
.remote_storage_config
.expect("Should have remote storage config for the local FS");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_sync: NonZeroUsize::new(
defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC
)
.unwrap(),
max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
},
"Remote storage config should correctly parse the local FS config and fill other storage defaults"
);
}
Ok(())
}
#[test]
fn parse_remote_s3_storage_config() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let bucket_name = "some-sample-bucket".to_string();
let bucket_region = "eu-north-1".to_string();
let prefix_in_bucket = "test_prefix".to_string();
let access_key_id = "SOMEKEYAAAAASADSAH*#".to_string();
let secret_access_key = "SOMEsEcReTsd292v".to_string();
let endpoint = "http://localhost:5000".to_string();
let max_concurrent_sync = NonZeroUsize::new(111).unwrap();
let max_sync_errors = NonZeroU32::new(222).unwrap();
let identical_toml_declarations = &[
format!(
r#"[remote_storage]
max_concurrent_sync = {}
max_sync_errors = {}
bucket_name = '{}'
bucket_region = '{}'
prefix_in_bucket = '{}'
access_key_id = '{}'
secret_access_key = '{}'
endpoint = '{}'"#,
max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint
),
format!(
"remote_storage={{max_concurrent_sync={}, max_sync_errors={}, bucket_name='{}', bucket_region='{}', prefix_in_bucket='{}', access_key_id='{}', secret_access_key='{}', endpoint='{}'}}",
max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint
),
];
for remote_storage_config_str in identical_toml_declarations {
let config_string = format!(
r#"{}
pg_distrib_dir='{}'
{}"#,
ALL_BASE_VALUES_TOML,
pg_distrib_dir.display(),
remote_storage_config_str,
);
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
})
.remote_storage_config
.expect("Should have remote storage config for S3");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_sync,
max_sync_errors,
storage: RemoteStorageKind::AwsS3(S3Config {
bucket_name: bucket_name.clone(),
bucket_region: bucket_region.clone(),
access_key_id: Some(access_key_id.clone()),
secret_access_key: Some(secret_access_key.clone()),
prefix_in_bucket: Some(prefix_in_bucket.clone()),
endpoint: Some(endpoint.clone())
}),
},
"Remote storage config should correctly parse the S3 config"
);
}
Ok(())
}
fn prepare_fs(tempdir: &TempDir) -> anyhow::Result<(PathBuf, PathBuf)> {
let tempdir_path = tempdir.path();
let workdir = tempdir_path.join("workdir");
fs::create_dir_all(&workdir)?;
let pg_distrib_dir = tempdir_path.join("pg_distrib");
fs::create_dir_all(&pg_distrib_dir)?;
let postgres_bin_dir = pg_distrib_dir.join("bin");
fs::create_dir_all(&postgres_bin_dir)?;
fs::write(postgres_bin_dir.join("postgres"), "I'm postgres, trust me")?;
Ok((workdir, pg_distrib_dir))
}
}

View File

@@ -1,6 +1,7 @@
use serde::{Deserialize, Serialize};
use crate::ZTenantId;
use zenith_utils::zid::ZNodeId;
#[derive(Serialize, Deserialize)]
pub struct BranchCreateRequest {
@@ -15,3 +16,8 @@ pub struct TenantCreateRequest {
#[serde(with = "hex")]
pub tenant_id: ZTenantId,
}
#[derive(Serialize)]
pub struct StatusResponse {
pub id: ZNodeId,
}

View File

@@ -17,6 +17,103 @@ paths:
application/json:
schema:
type: object
required:
- id
properties:
id:
type: integer
/v1/timeline/{tenant_id}:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
format: hex
get:
description: List tenant timelines
responses:
"200":
description: array of brief timeline descriptions
content:
application/json:
schema:
type: array
items:
# currently, just a timeline id string, but when remote index gets to be accessed
# remote/local timeline field would be added at least
type: string
"400":
description: Error when no tenant id found in path
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/timeline/{tenant_id}/{timeline_id}:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
format: hex
- name: timeline_id
in: path
required: true
schema:
type: string
format: hex
get:
description: Get timeline info for tenant's remote timeline
responses:
"200":
description: TimelineInfo
content:
application/json:
schema:
$ref: "#/components/schemas/TimelineInfo"
"400":
description: Error when no tenant id found in path or no branch name
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/branch/{tenant_id}:
parameters:
- name: tenant_id
@@ -25,6 +122,11 @@ paths:
schema:
type: string
format: hex
- name: include-non-incremental-logical-size
in: query
schema:
type: string
description: Controls calculation of current_logical_size_non_incremental
get:
description: Get branches for tenant
responses:
@@ -73,6 +175,11 @@ paths:
required: true
schema:
type: string
- name: include-non-incremental-logical-size
in: query
schema:
type: string
description: Controls calculation of current_logical_size_non_incremental
get:
description: Get branches for tenant
responses:
@@ -132,9 +239,7 @@ paths:
content:
application/json:
schema:
type: array
items:
$ref: "#/components/schemas/BranchInfo"
$ref: "#/components/schemas/BranchInfo"
"400":
description: Malformed branch create request
content:
@@ -164,13 +269,13 @@ paths:
description: Get tenants list
responses:
"200":
description: OK
description: TenantInfo
content:
application/json:
schema:
type: array
items:
type: string
$ref: "#/components/schemas/TenantInfo"
"401":
description: Unauthorized Error
content:
@@ -243,6 +348,16 @@ components:
scheme: bearer
bearerFormat: JWT
schemas:
TenantInfo:
type: object
required:
- id
- state
properties:
id:
type: string
state:
type: string
BranchInfo:
type: object
required:
@@ -250,7 +365,6 @@ components:
- timeline_id
- latest_valid_lsn
- current_logical_size
- current_logical_size_non_incremental
properties:
name:
type: string
@@ -259,12 +373,45 @@ components:
format: hex
ancestor_id:
type: string
format: hex
ancestor_lsn:
type: string
current_logical_size:
type: integer
current_logical_size_non_incremental:
type: integer
latest_valid_lsn:
type: integer
TimelineInfo:
type: object
required:
- timeline_id
- tenant_id
- last_record_lsn
- prev_record_lsn
- start_lsn
- disk_consistent_lsn
properties:
timeline_id:
type: string
format: hex
tenant_id:
type: string
format: hex
ancestor_timeline_id:
type: string
format: hex
last_record_lsn:
type: string
prev_record_lsn:
type: string
start_lsn:
type: string
disk_consistent_lsn:
type: string
timeline_state:
type: string
Error:
type: object
required:

View File

@@ -1,11 +1,10 @@
use std::str::FromStr;
use std::sync::Arc;
use anyhow::Result;
use hyper::header;
use anyhow::{Context, Result};
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use routerify::{ext::RequestExt, RouterBuilder};
use serde::Serialize;
use tracing::*;
use zenith_utils::auth::JwtAuth;
use zenith_utils::http::endpoint::attach_openapi_ui;
use zenith_utils::http::endpoint::auth_middleware;
@@ -15,12 +14,20 @@ use zenith_utils::http::{
endpoint,
error::HttpErrorBody,
json::{json_request, json_response},
request::get_request_param,
request::parse_request_param,
};
use zenith_utils::http::{RequestExt, RouterBuilder};
use zenith_utils::lsn::Lsn;
use zenith_utils::zid::{opt_display_serde, ZTimelineId};
use super::models::BranchCreateRequest;
use super::models::StatusResponse;
use super::models::TenantCreateRequest;
use crate::branches::BranchInfo;
use crate::{branches, tenant_mgr, PageServerConf, ZTenantId};
use crate::repository::RepositoryTimeline;
use crate::repository::TimelineSyncState;
use crate::{branches, config::PageServerConf, tenant_mgr, ZTenantId};
#[derive(Debug)]
struct State {
@@ -56,40 +63,13 @@ fn get_config(request: &Request<Body>) -> &'static PageServerConf {
get_state(request).conf
}
fn get_request_param<'a>(
request: &'a Request<Body>,
param_name: &str,
) -> Result<&'a str, ApiError> {
match request.param(param_name) {
Some(arg) => Ok(arg),
None => {
return Err(ApiError::BadRequest(format!(
"no {} specified in path param",
param_name
)))
}
}
}
fn parse_request_param<T: FromStr>(
request: &Request<Body>,
param_name: &str,
) -> Result<T, ApiError> {
match get_request_param(request, param_name)?.parse() {
Ok(v) => Ok(v),
Err(_) => Err(ApiError::BadRequest(
"failed to parse tenant id".to_string(),
)),
}
}
// healthcheck handler
async fn status_handler(_: Request<Body>) -> Result<Response<Body>, ApiError> {
Ok(Response::builder()
.status(StatusCode::OK)
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from("{}"))
.map_err(ApiError::from_err)?)
async fn status_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let config = get_config(&request);
Ok(json_response(
StatusCode::OK,
StatusResponse { id: config.id },
)?)
}
async fn branch_create_handler(mut request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -98,6 +78,7 @@ async fn branch_create_handler(mut request: Request<Body>) -> Result<Response<Bo
check_permission(&request, Some(request_data.tenant_id))?;
let response_data = tokio::task::spawn_blocking(move || {
let _enter = info_span!("/branch_create", name = %request_data.name, tenant = %request_data.tenant_id, startpoint=%request_data.start_point).entered();
branches::create_branch(
get_config(&request),
&request_data.name,
@@ -110,29 +91,53 @@ async fn branch_create_handler(mut request: Request<Body>) -> Result<Response<Bo
Ok(json_response(StatusCode::CREATED, response_data)?)
}
// Gate non incremental logical size calculation behind a flag
// after pgbench -i -s100 calculation took 28ms so if multiplied by the number of timelines
// and tenants it can take noticeable amount of time. Also the value currently used only in tests
fn get_include_non_incremental_logical_size(request: &Request<Body>) -> bool {
request
.uri()
.query()
.map(|v| {
url::form_urlencoded::parse(v.as_bytes())
.into_owned()
.any(|(param, _)| param == "include-non-incremental-logical-size")
})
.unwrap_or(false)
}
async fn branch_list_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenantid: ZTenantId = parse_request_param(&request, "tenant_id")?;
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
check_permission(&request, Some(tenantid))?;
let response_data = tokio::task::spawn_blocking(move || {
crate::branches::get_branches(get_config(&request), &tenantid)
let _enter = info_span!("branch_list", tenant = %tenantid).entered();
crate::branches::get_branches(
get_config(&request),
&tenantid,
include_non_incremental_logical_size,
)
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
}
// TODO add to swagger
async fn branch_detail_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenantid: ZTenantId = parse_request_param(&request, "tenant_id")?;
let branch_name: &str = get_request_param(&request, "branch_name")?;
let branch_name: String = get_request_param(&request, "branch_name")?.to_string();
let conf = get_state(&request).conf;
let path = conf.branch_path(branch_name, &tenantid);
let path = conf.branch_path(&branch_name, &tenantid);
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
let response_data = tokio::task::spawn_blocking(move || {
let _enter = info_span!("branch_detail", tenant = %tenantid, branch=%branch_name).entered();
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
BranchInfo::from_path(path, conf, &tenantid, &repo)
BranchInfo::from_path(path, &repo, include_non_incremental_logical_size)
})
.await
.map_err(ApiError::from_err)??;
@@ -140,14 +145,170 @@ async fn branch_detail_handler(request: Request<Body>) -> Result<Response<Body>,
Ok(json_response(StatusCode::OK, response_data)?)
}
async fn timeline_list_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let conf = get_state(&request).conf;
let timelines_dir = conf.timelines_path(&tenant_id);
let mut timelines_dir_contents =
tokio::fs::read_dir(&timelines_dir).await.with_context(|| {
format!(
"Failed to list timelines dir '{}' contents",
timelines_dir.display()
)
})?;
let mut local_timelines = Vec::new();
while let Some(entry) = timelines_dir_contents.next_entry().await.with_context(|| {
format!(
"Failed to list timelines dir '{}' contents",
timelines_dir.display()
)
})? {
let entry_path = entry.path();
let entry_type = entry.file_type().await.with_context(|| {
format!(
"Failed to get file type of timeline dirs' entry '{}'",
entry_path.display()
)
})?;
if entry_type.is_dir() {
match entry.file_name().to_string_lossy().parse::<ZTimelineId>() {
Ok(timeline_id) => local_timelines.push(timeline_id.to_string()),
Err(e) => error!(
"Failed to get parse timeline id from timeline dirs' entry '{}': {}",
entry_path.display(),
e
),
}
}
}
Ok(json_response(StatusCode::OK, local_timelines)?)
}
#[derive(Debug, Serialize)]
#[serde(tag = "type")]
enum TimelineInfo {
Local {
#[serde(with = "hex")]
timeline_id: ZTimelineId,
#[serde(with = "hex")]
tenant_id: ZTenantId,
#[serde(with = "opt_display_serde")]
ancestor_timeline_id: Option<ZTimelineId>,
last_record_lsn: Lsn,
prev_record_lsn: Lsn,
disk_consistent_lsn: Lsn,
timeline_state: Option<TimelineSyncState>,
},
Remote {
#[serde(with = "hex")]
timeline_id: ZTimelineId,
#[serde(with = "hex")]
tenant_id: ZTenantId,
},
}
async fn timeline_detail_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let response_data = tokio::task::spawn_blocking(move || {
let _enter =
info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id)
.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
Ok::<_, anyhow::Error>(match repo.get_timeline(timeline_id)?.local_timeline() {
None => TimelineInfo::Remote {
timeline_id,
tenant_id,
},
Some(timeline) => TimelineInfo::Local {
timeline_id,
tenant_id,
ancestor_timeline_id: timeline.get_ancestor_timeline_id(),
disk_consistent_lsn: timeline.get_disk_consistent_lsn(),
last_record_lsn: timeline.get_last_record_lsn(),
prev_record_lsn: timeline.get_prev_record_lsn(),
timeline_state: repo.get_timeline_state(timeline_id),
},
})
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
}
async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
tokio::task::spawn_blocking(move || {
let _enter =
info_span!("timeline_attach_handler", tenant = %tenant_id, timeline = %timeline_id)
.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
match repo.get_timeline(timeline_id)? {
RepositoryTimeline::Local(_) => {
anyhow::bail!("Timeline with id {} is already local", timeline_id)
}
RepositoryTimeline::Remote {
id: _,
disk_consistent_lsn: _,
} => {
// FIXME (rodionov) get timeline already schedules timeline for download, and duplicate tasks can cause errors
// first should be fixed in https://github.com/zenithdb/zenith/issues/997
// TODO (rodionov) change timeline state to awaits download (incapsulate it somewhere in the repo)
// TODO (rodionov) can we safely request replication on the timeline before sync is completed? (can be implemented on top of the #997)
Ok(())
}
}
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::ACCEPTED, ())?)
}
async fn timeline_detach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
tokio::task::spawn_blocking(move || {
let _enter =
info_span!("timeline_detach_handler", tenant = %tenant_id, timeline = %timeline_id)
.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
repo.detach_timeline(timeline_id)
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, ())?)
}
async fn tenant_list_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
// check for management permission
check_permission(&request, None)?;
let response_data =
tokio::task::spawn_blocking(move || crate::branches::get_tenants(get_config(&request)))
.await
.map_err(ApiError::from_err)??;
let response_data = tokio::task::spawn_blocking(move || {
let _enter = info_span!("tenant_list").entered();
crate::tenant_mgr::list_tenants()
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
}
@@ -157,12 +318,13 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
let request_data: TenantCreateRequest = json_request(&mut request).await?;
let response_data = tokio::task::spawn_blocking(move || {
tokio::task::spawn_blocking(move || {
let _enter = info_span!("tenant_create", tenant = %request_data.tenant_id).entered();
tenant_mgr::create_repository_for_tenant(get_config(&request), request_data.tenant_id)
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::CREATED, response_data)?)
Ok(json_response(StatusCode::CREATED, ())?)
}
async fn handler_404(_: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -192,6 +354,19 @@ pub fn make_router(
router
.data(Arc::new(State::new(conf, auth)))
.get("/v1/status", status_handler)
.get("/v1/timeline/:tenant_id", timeline_list_handler)
.get(
"/v1/timeline/:tenant_id/:timeline_id",
timeline_detail_handler,
)
.post(
"/v1/timeline/:tenant_id/:timeline_id/attach",
timeline_attach_handler,
)
.post(
"/v1/timeline/:tenant_id/:timeline_id/detach",
timeline_detach_handler,
)
.get("/v1/branch/:tenant_id", branch_list_handler)
.get("/v1/branch/:tenant_id/:branch_name", branch_detail_handler)
.post("/v1/branch", branch_create_handler)

View File

@@ -0,0 +1,380 @@
//!
//! Import data and WAL from a PostgreSQL data directory and WAL segments into
//! a zenith Timeline.
//!
use std::fs;
use std::fs::File;
use std::io::{Read, Seek, SeekFrom};
use std::path::{Path, PathBuf};
use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
use tracing::*;
use crate::relish::*;
use crate::repository::*;
use crate::walingest::WalIngest;
use postgres_ffi::relfile_utils::*;
use postgres_ffi::waldecoder::*;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::Oid;
use postgres_ffi::{pg_constants, ControlFileData, DBState_DB_SHUTDOWNED};
use zenith_utils::lsn::Lsn;
///
/// Import all relation data pages from local disk into the repository.
///
/// This is currently only used to import a cluster freshly created by initdb.
/// The code that deals with the checkpoint would not work right if the
/// cluster was not shut down cleanly.
pub fn import_timeline_from_postgres_datadir(
path: &Path,
writer: &dyn TimelineWriter,
lsn: Lsn,
) -> Result<()> {
let mut pg_control: Option<ControlFileData> = None;
// Scan 'global'
for direntry in fs::read_dir(path.join("global"))? {
let direntry = direntry?;
match direntry.file_name().to_str() {
None => continue,
Some("pg_control") => {
pg_control = Some(import_control_file(writer, lsn, &direntry.path())?);
}
Some("pg_filenode.map") => import_nonrel_file(
writer,
lsn,
RelishTag::FileNodeMap {
spcnode: pg_constants::GLOBALTABLESPACE_OID,
dbnode: 0,
},
&direntry.path(),
)?,
// Load any relation files into the page server
_ => import_relfile(
&direntry.path(),
writer,
lsn,
pg_constants::GLOBALTABLESPACE_OID,
0,
)?,
}
}
// Scan 'base'. It contains database dirs, the database OID is the filename.
// E.g. 'base/12345', where 12345 is the database OID.
for direntry in fs::read_dir(path.join("base"))? {
let direntry = direntry?;
//skip all temporary files
if direntry.file_name().to_str().unwrap() == "pgsql_tmp" {
continue;
}
let dboid = direntry.file_name().to_str().unwrap().parse::<u32>()?;
for direntry in fs::read_dir(direntry.path())? {
let direntry = direntry?;
match direntry.file_name().to_str() {
None => continue,
Some("PG_VERSION") => continue,
Some("pg_filenode.map") => import_nonrel_file(
writer,
lsn,
RelishTag::FileNodeMap {
spcnode: pg_constants::DEFAULTTABLESPACE_OID,
dbnode: dboid,
},
&direntry.path(),
)?,
// Load any relation files into the page server
_ => import_relfile(
&direntry.path(),
writer,
lsn,
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
)?,
}
}
}
for entry in fs::read_dir(path.join("pg_xact"))? {
let entry = entry?;
import_slru_file(writer, lsn, SlruKind::Clog, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("members"))? {
let entry = entry?;
import_slru_file(writer, lsn, SlruKind::MultiXactMembers, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("offsets"))? {
let entry = entry?;
import_slru_file(writer, lsn, SlruKind::MultiXactOffsets, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_twophase"))? {
let entry = entry?;
let xid = u32::from_str_radix(entry.path().to_str().unwrap(), 16)?;
import_nonrel_file(writer, lsn, RelishTag::TwoPhase { xid }, &entry.path())?;
}
// TODO: Scan pg_tblspc
// We're done importing all the data files.
writer.advance_last_record_lsn(lsn);
// We expect the Postgres server to be shut down cleanly.
let pg_control = pg_control.context("pg_control file not found")?;
ensure!(
pg_control.state == DBState_DB_SHUTDOWNED,
"Postgres cluster was not shut down cleanly"
);
ensure!(
pg_control.checkPointCopy.redo == lsn.0,
"unexpected checkpoint REDO pointer"
);
// Import WAL. This is needed even when starting from a shutdown checkpoint, because
// this reads the checkpoint record itself, advancing the tip of the timeline to
// *after* the checkpoint record. And crucially, it initializes the 'prev_lsn'.
import_wal(
&path.join("pg_wal"),
writer,
Lsn(pg_control.checkPointCopy.redo),
lsn,
)?;
Ok(())
}
// subroutine of import_timeline_from_postgres_datadir(), to load one relation file.
fn import_relfile(
path: &Path,
timeline: &dyn TimelineWriter,
lsn: Lsn,
spcoid: Oid,
dboid: Oid,
) -> Result<()> {
// Does it look like a relation file?
trace!("importing rel file {}", path.display());
let p = parse_relfilename(path.file_name().unwrap().to_str().unwrap());
if let Err(e) = p {
warn!("unrecognized file in postgres datadir: {:?} ({})", path, e);
return Err(e.into());
}
let (relnode, forknum, segno) = p.unwrap();
let mut file = File::open(path)?;
let mut buf: [u8; 8192] = [0u8; 8192];
let mut blknum: u32 = segno * (1024 * 1024 * 1024 / pg_constants::BLCKSZ as u32);
loop {
let r = file.read_exact(&mut buf);
match r {
Ok(_) => {
let rel = RelTag {
spcnode: spcoid,
dbnode: dboid,
relnode,
forknum,
};
let tag = RelishTag::Relation(rel);
timeline.put_page_image(tag, blknum, lsn, Bytes::copy_from_slice(&buf))?;
}
// TODO: UnexpectedEof is expected
Err(err) => match err.kind() {
std::io::ErrorKind::UnexpectedEof => {
// reached EOF. That's expected.
// FIXME: maybe check that we read the full length of the file?
break;
}
_ => {
bail!("error reading file {}: {:#}", path.display(), err);
}
},
};
blknum += 1;
}
Ok(())
}
///
/// Import a "non-blocky" file into the repository
///
/// This is used for small files like the control file, twophase files etc. that
/// are just slurped into the repository as one blob.
///
fn import_nonrel_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
tag: RelishTag,
path: &Path,
) -> Result<()> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
// read the whole file
file.read_to_end(&mut buffer)?;
trace!("importing non-rel file {}", path.display());
timeline.put_page_image(tag, 0, lsn, Bytes::copy_from_slice(&buffer[..]))?;
Ok(())
}
///
/// Import pg_control file into the repository.
///
/// The control file is imported as is, but we also extract the checkpoint record
/// from it and store it separated.
fn import_control_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
path: &Path,
) -> Result<ControlFileData> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
// read the whole file
file.read_to_end(&mut buffer)?;
trace!("importing control file {}", path.display());
// Import it as ControlFile
timeline.put_page_image(
RelishTag::ControlFile,
0,
lsn,
Bytes::copy_from_slice(&buffer[..]),
)?;
// Extract the checkpoint record and import it separately.
let pg_control = ControlFileData::decode(&buffer)?;
let checkpoint_bytes = pg_control.checkPointCopy.encode();
timeline.put_page_image(RelishTag::Checkpoint, 0, lsn, checkpoint_bytes)?;
Ok(pg_control)
}
///
/// Import an SLRU segment file
///
fn import_slru_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
slru: SlruKind,
path: &Path,
) -> Result<()> {
// Does it look like an SLRU file?
let mut file = File::open(path)?;
let mut buf: [u8; 8192] = [0u8; 8192];
let segno = u32::from_str_radix(path.file_name().unwrap().to_str().unwrap(), 16)?;
trace!("importing slru file {}", path.display());
let mut rpageno = 0;
loop {
let r = file.read_exact(&mut buf);
match r {
Ok(_) => {
timeline.put_page_image(
RelishTag::Slru { slru, segno },
rpageno,
lsn,
Bytes::copy_from_slice(&buf),
)?;
}
// TODO: UnexpectedEof is expected
Err(err) => match err.kind() {
std::io::ErrorKind::UnexpectedEof => {
// reached EOF. That's expected.
// FIXME: maybe check that we read the full length of the file?
break;
}
_ => {
bail!("error reading file {}: {:#}", path.display(), err);
}
},
};
rpageno += 1;
// TODO: Check that the file isn't unexpectedly large, not larger than SLRU_PAGES_PER_SEGMENT pages
}
Ok(())
}
/// Scan PostgreSQL WAL files in given directory and load all records between
/// 'startpoint' and 'endpoint' into the repository.
fn import_wal(
walpath: &Path,
writer: &dyn TimelineWriter,
startpoint: Lsn,
endpoint: Lsn,
) -> Result<()> {
let mut waldecoder = WalStreamDecoder::new(startpoint);
let mut segno = startpoint.segment_number(pg_constants::WAL_SEGMENT_SIZE);
let mut offset = startpoint.segment_offset(pg_constants::WAL_SEGMENT_SIZE);
let mut last_lsn = startpoint;
let mut walingest = WalIngest::new(writer.deref(), startpoint)?;
while last_lsn <= endpoint {
// FIXME: assume postgresql tli 1 for now
let filename = XLogFileName(1, segno, pg_constants::WAL_SEGMENT_SIZE);
let mut buf = Vec::new();
// Read local file
let mut path = walpath.join(&filename);
// It could be as .partial
if !PathBuf::from(&path).exists() {
path = walpath.join(filename + ".partial");
}
// Slurp the WAL file
let mut file = File::open(&path)?;
if offset > 0 {
file.seek(SeekFrom::Start(offset as u64))?;
}
let nread = file.read_to_end(&mut buf)?;
if nread != pg_constants::WAL_SEGMENT_SIZE - offset as usize {
// Maybe allow this for .partial files?
error!("read only {} bytes from WAL file", nread);
}
waldecoder.feed_bytes(&buf);
let mut nrecords = 0;
while last_lsn <= endpoint {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
walingest.ingest_record(writer, recdata, lsn)?;
last_lsn = lsn;
nrecords += 1;
trace!("imported record at {} (end {})", lsn, endpoint);
}
}
debug!("imported {} records up to {}", nrecords, last_lsn);
segno += 1;
offset = 0;
}
if last_lsn != startpoint {
debug!("reached end of WAL at {}", last_lsn);
} else {
info!("no WAL to import at {}", last_lsn);
}
Ok(())
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,12 +1,56 @@
# Overview
The on-disk format is based on immutable files. The page server
receives a stream of incoming WAL, parses the WAL records to determine
which pages they apply to, and accumulates the incoming changes in
memory. Every now and then, the accumulated changes are written out to
new immutable files. This process is called checkpointing. Old versions
of on-disk files that are not needed by any timeline are removed by GC
process.
The on-disk format is based on immutable files. The page server receives a
stream of incoming WAL, parses the WAL records to determine which pages they
apply to, and accumulates the incoming changes in memory. Every now and then,
the accumulated changes are written out to new immutable files. This process is
called checkpointing. Old versions of on-disk files that are not needed by any
timeline are removed by GC process.
The main responsibility of the Page Server is to process the incoming WAL, and
reprocess it into a format that allows reasonably quick access to any page
version.
The incoming WAL contains updates to arbitrary pages in the system. The
distribution depends on the workload: the updates could be totally random, or
there could be a long stream of updates to a single relation when data is bulk
loaded, for example, or something in between. The page server slices the
incoming WAL per relation and page, and packages the sliced WAL into
suitably-sized "layer files". The layer files contain all the history of the
database, back to some reasonable retention period. This system replaces the
base backups and the WAL archive used in a traditional PostgreSQL
installation. The layer files are immutable, they are not modified in-place
after creation. New layer files are created for new incoming WAL, and old layer
files are removed when they are no longer needed. We could also replace layer
files with new files that contain the same information, merging small files for
example, but that hasn't been implemented yet.
Cloud Storage Page Server Safekeeper
Local disk Memory WAL
|AAAA| |AAAA|AAAA| |AA
|BBBB| |BBBB|BBBB| |
|CCCC|CCCC| <---- |CCCC|CCCC|CCCC| <--- |CC <---- ADEBAABED
|DDDD|DDDD| |DDDD|DDDD| |DDD
|EEEE| |EEEE|EEEE|EEEE| |E
In this illustration, WAL is received as a stream from the Safekeeper, from the
right. It is immediately captured by the page server and stored quickly in
memory. The page server memory can be thought of as a quick "reorder buffer",
used to hold the incoming WAL and reorder it so that we keep the WAL records for
the same page and relation close to each other.
From the page server memory, whenever enough WAL has been accumulated for one
relation segment, it is moved to local disk, as a new layer file, and the memory
is released.
From the local disk, the layers are further copied to Cloud Storage, for
long-term archival. After a layer has been copied to Cloud Storage, it can be
removed from local disk, although we currently keep everything locally for fast
access. If a layer is needed that isn't found locally, it is fetched from Cloud
Storage and stored in local disk.
# Terms used in layered repository
@@ -14,30 +58,9 @@ process.
- Segment - one slice of a Relish that is stored in a LayeredTimeline.
- Layer - specific version of a relish Segment in a range of LSNs.
Layers can be InMemory or OnDisk:
- InMemory layer is not durably stored and needs to rebuild from WAL on pageserver start.
- OnDisk layer is durably stored.
# Layer map
OnDisk layers can be Image or Delta:
- ImageLayer represents an image or a snapshot of a segment at one particular LSN.
- DeltaLayer represents a collection of WAL records or page images in a range of LSNs.
Dropped segments are always represented on disk by DeltaLayer.
LSN range defined by start_lsn and end_lsn:
- start_lsn is always inclusive.
- end_lsn depends on layer kind:
- InMemoryLayer is either unbounded (end_lsn = MAX_LSN) or dropped (end_lsn = drop_lsn)
- ImageLayer represents snapshot at one LSN, so end_lsn = lsn.
- DeltaLayer has explicit end_lsn, which represents end of incremental layer.
Layers can be open or historical:
- Open layer is a writeable one. Only InMemory layer can be open.
FIXME: If open layer is dropped, it is not writeable, so it should be turned into historical,
but now it is not implemented - see bug #569.
- Historical layer is the one that cannot be modified anymore. Now only OnDisk layers can be historical.
- LayerMap - a map that tracks what layers exist for all the relishes in a timeline.
The LayerMap tracks what layers exist for all the relishes in a timeline.
LayerMap consists of two data structures:
- segs - All the layers keyed by segment tag
@@ -52,8 +75,55 @@ TODO: Are there any exceptions to this?
For example, timeline.list_rels(lsn) will return all segments that are visible in this timeline at the LSN,
including ones that were not modified in this timeline and thus don't have a layer in the timeline's LayerMap.
TODO:
Describe GC and checkpoint interval settings.
# Different kinds of layers
A layer can be in different states:
- Open - a layer where new WAL records can be appended to.
- Closed - a layer that is read-only, no new WAL records can be appended to it
- Historic: synonym for closed
- InMemory: A layer that needs to be rebuilt from WAL on pageserver start.
To avoid OOM errors, InMemory layers can be spilled to disk into ephemeral file.
- OnDisk: A layer that is stored on disk. If its end-LSN is older than
disk_consistent_lsn, it is known to be fully flushed and fsync'd to local disk.
- Frozen layer: an in-memory layer that is Closed.
TODO: Clarify the difference between Closed, Historic and Frozen.
There are two kinds of OnDisk layers:
- ImageLayer represents an image or a snapshot of a 10 MB relish segment, at one particular LSN.
- DeltaLayer represents a collection of WAL records or page images in a range of LSNs, for one
relish segment.
Dropped segments are always represented on disk by DeltaLayer.
# Layer life cycle
LSN range defined by start_lsn and end_lsn:
- start_lsn is inclusive.
- end_lsn is exclusive.
For an open in-memory layer, the end_lsn is MAX_LSN. For a frozen in-memory
layer or a delta layer, it is a valid end bound. An image layer represents
snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
Every layer starts its life as an Open In-Memory layer. When the page server
receives the first WAL record for a segment, it creates a new In-Memory layer
for it, and puts it to the layer map. Later, the layer is old enough, its
contents are written to disk, as On-Disk layers. This process is called
"evicting" a layer.
Layer eviction is a two-step process: First, the layer is marked as closed, so
that it no longer accepts new WAL records, and the layer map is updated
accordingly. If a new WAL record for that segment arrives after this step, a new
Open layer is created to hold it. After this first step, the layer is a Closed
InMemory state. This first step is called "freezing" the layer.
In the second step, new Delta and Image layers are created, containing all the
data in the Frozen InMemory layer. When the new layers are ready, the original
frozen layer is replaced with the new layers in the layer map, and the original
frozen layer is dropped, releasing the memory.
# Layer files (On-disk layers)
@@ -364,6 +434,8 @@ is a newer layer file there. TODO: This optimization hasn't been
implemented! The GC algorithm will currently keep the file on the
'main' branch anyway, for as long as the child branch exists.
TODO:
Describe GC and checkpoint interval settings.
# TODO: On LSN ranges

View File

@@ -1,45 +0,0 @@
use std::{fs::File, io::Write};
use anyhow::Result;
use bookfile::{BookWriter, BoundedReader, ChapterId, ChapterWriter};
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize)]
pub struct BlobRange {
offset: u64,
size: usize,
}
pub fn read_blob(reader: &BoundedReader<&'_ File>, range: &BlobRange) -> Result<Vec<u8>> {
let mut buf = vec![0u8; range.size];
reader.read_exact_at(&mut buf, range.offset)?;
Ok(buf)
}
pub struct BlobWriter<W> {
writer: ChapterWriter<W>,
offset: u64,
}
impl<W: Write> BlobWriter<W> {
// This function takes a BookWriter and creates a new chapter to ensure offset is 0.
pub fn new(book_writer: BookWriter<W>, chapter_id: impl Into<ChapterId>) -> Self {
let writer = book_writer.new_chapter(chapter_id);
Self { writer, offset: 0 }
}
pub fn write_blob(&mut self, blob: &[u8]) -> Result<BlobRange> {
self.writer.write_all(blob)?;
let range = BlobRange {
offset: self.offset,
size: blob.len(),
};
self.offset += blob.len() as u64;
Ok(range)
}
pub fn close(self) -> bookfile::Result<BookWriter<W>> {
self.writer.close()
}
}

View File

@@ -11,7 +11,7 @@
//! can happen when you create a new branch in the middle of a delta layer, and the WAL
//! records on the new branch are put in a new delta layer.
//!
//! When a delta file needs to be accessed, we slurp the metadata and relsize chapters
//! When a delta file needs to be accessed, we slurp the metadata and segsize chapters
//! into memory, into the DeltaLayerInner struct. See load() and unload() functions.
//! To access a page/WAL record, we search `page_version_metas` for the block # and LSN.
//! The byte ranges in the metadata can be used to find the page/WAL record in
@@ -35,39 +35,36 @@
//! file contents in any way.
//!
//! A detlta file is constructed using the 'bookfile' crate. Each file consists of two
//! parts: the page versions and the relation sizes. They are stored as separate chapters.
//! parts: the page versions and the segment sizes. They are stored as separate chapters.
//!
use crate::layered_repository::blob::BlobWriter;
use crate::config::PageServerConf;
use crate::layered_repository::filename::{DeltaFileName, PathOrConf};
use crate::layered_repository::storage_layer::{
Layer, PageReconstructData, PageReconstructResult, PageVersion, SegmentTag,
Layer, PageReconstructData, PageReconstructResult, PageVersion, SegmentBlk, SegmentTag,
RELISH_SEG_SIZE,
};
use crate::repository::WALRecord;
use crate::waldecoder;
use crate::PageServerConf;
use crate::virtual_file::VirtualFile;
use crate::walrecord;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{bail, Result};
use bytes::Bytes;
use anyhow::{bail, ensure, Result};
use log::*;
use serde::{Deserialize, Serialize};
use std::collections::BTreeMap;
use zenith_utils::vec_map::VecMap;
// avoid binding to Write (conflicts with std::io::Write)
// while being able to use std::fmt::Write's methods
use std::fmt::Write as _;
use std::fs;
use std::fs::File;
use std::io::{BufWriter, Write};
use std::ops::Bound::Included;
use std::os::unix::fs::FileExt;
use std::path::{Path, PathBuf};
use std::sync::{Arc, Mutex, MutexGuard};
use std::sync::{Mutex, MutexGuard};
use bookfile::{Book, BookWriter};
use bookfile::{Book, BookWriter, BoundedReader, ChapterWriter};
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
use super::blob::{read_blob, BlobRange};
// Magic constant to identify a Zenith delta file
pub const DELTA_FILE_MAGIC: u32 = 0x5A616E01;
@@ -77,7 +74,7 @@ static PAGE_VERSION_METAS_CHAPTER: u64 = 1;
/// Page/WAL bytes - cannot be interpreted
/// without PAGE_VERSION_METAS_CHAPTER
static PAGE_VERSIONS_CHAPTER: u64 = 2;
static REL_SIZES_CHAPTER: u64 = 3;
static SEG_SIZES_CHAPTER: u64 = 3;
/// Contains the [`Summary`] struct
static SUMMARY_CHAPTER: u64 = 4;
@@ -110,9 +107,15 @@ impl From<&DeltaLayer> for Summary {
}
#[derive(Serialize, Deserialize)]
struct PageVersionMeta {
page_image_range: Option<BlobRange>,
record_range: Option<BlobRange>,
struct BlobRange {
offset: u64,
size: usize,
}
fn read_blob<F: FileExt>(reader: &BoundedReader<&'_ F>, range: &BlobRange) -> Result<Vec<u8>> {
let mut buf = vec![0u8; range.size];
reader.read_exact_at(&mut buf, range.offset)?;
Ok(buf)
}
///
@@ -139,26 +142,43 @@ pub struct DeltaLayer {
dropped: bool,
/// Predecessor layer
predecessor: Option<Arc<dyn Layer>>,
inner: Mutex<DeltaLayerInner>,
}
pub struct DeltaLayerInner {
/// If false, the 'page_version_metas' and 'relsizes' have not been
/// If false, the 'page_version_metas' and 'seg_sizes' have not been
/// loaded into memory yet.
loaded: bool,
book: Option<Book<VirtualFile>>,
/// All versions of all pages in the file are are kept here.
/// Indexed by block number and LSN.
page_version_metas: BTreeMap<(u32, Lsn), PageVersionMeta>,
page_version_metas: VecMap<(SegmentBlk, Lsn), BlobRange>,
/// `relsizes` tracks the size of the relation at different points in time.
relsizes: BTreeMap<Lsn, u32>,
/// `seg_sizes` tracks the size of the segment at different points in time.
seg_sizes: VecMap<Lsn, SegmentBlk>,
}
impl DeltaLayerInner {
fn get_seg_size(&self, lsn: Lsn) -> Result<SegmentBlk> {
// Scan the VecMap backwards, starting from the given entry.
let slice = self
.seg_sizes
.slice_range((Included(&Lsn(0)), Included(&lsn)));
if let Some((_entry_lsn, entry)) = slice.last() {
Ok(*entry)
} else {
bail!("could not find seg size in delta layer")
}
}
}
impl Layer for DeltaLayer {
fn get_tenant_id(&self) -> ZTenantId {
self.tenantid
}
fn get_timeline_id(&self) -> ZTimelineId {
self.timelineid
}
@@ -180,114 +200,105 @@ impl Layer for DeltaLayer {
}
fn filename(&self) -> PathBuf {
PathBuf::from(
DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
}
.to_string(),
)
}
fn path(&self) -> Option<PathBuf> {
Some(Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
},
))
PathBuf::from(self.layer_name().to_string())
}
/// Look up given page in the cache.
fn get_page_reconstruct_data(
&self,
blknum: u32,
blknum: SegmentBlk,
lsn: Lsn,
cached_img_lsn: Option<Lsn>,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult> {
let mut need_image = true;
assert!(self.seg.blknum_in_seg(blknum));
assert!((0..RELISH_SEG_SIZE).contains(&blknum));
match &cached_img_lsn {
Some(cached_lsn) if &self.end_lsn <= cached_lsn => {
return Ok(PageReconstructResult::Cached)
}
_ => {}
}
{
// Open the file and lock the metadata in memory
// TODO: avoid opening the file for each read
let (_path, book) = self.open_book()?;
let page_version_reader = book.chapter_reader(PAGE_VERSIONS_CHAPTER)?;
let inner = self.load()?;
let page_version_reader = inner
.book
.as_ref()
.expect("should be loaded in load call above")
.chapter_reader(PAGE_VERSIONS_CHAPTER)?;
// Scan the metadata BTreeMap backwards, starting from the given entry.
// Scan the metadata VecMap backwards, starting from the given entry.
let minkey = (blknum, Lsn(0));
let maxkey = (blknum, lsn);
let mut iter = inner
let iter = inner
.page_version_metas
.range((Included(&minkey), Included(&maxkey)));
while let Some(((_blknum, _entry_lsn), entry)) = iter.next_back() {
if let Some(img_range) = &entry.page_image_range {
// Found a page image, return it
let img = Bytes::from(read_blob(&page_version_reader, img_range)?);
reconstruct_data.page_img = Some(img);
need_image = false;
break;
} else if let Some(rec_range) = &entry.record_range {
let rec = WALRecord::des(&read_blob(&page_version_reader, rec_range)?)?;
let will_init = rec.will_init;
reconstruct_data.records.push(rec);
if will_init {
// This WAL record initializes the page, so no need to go further back
.slice_range((Included(&minkey), Included(&maxkey)))
.iter()
.rev();
for ((_blknum, pv_lsn), blob_range) in iter {
match &cached_img_lsn {
Some(cached_lsn) if pv_lsn <= cached_lsn => {
return Ok(PageReconstructResult::Cached)
}
_ => {}
}
let pv = PageVersion::des(&read_blob(&page_version_reader, blob_range)?)?;
match pv {
PageVersion::Page(img) => {
// Found a page image, return it
reconstruct_data.page_img = Some(img);
need_image = false;
break;
}
} else {
// No base image, and no WAL record. Huh?
bail!("no page image or WAL record for requested page");
PageVersion::Wal(rec) => {
let will_init = rec.will_init();
reconstruct_data.records.push((*pv_lsn, rec));
if will_init {
// This WAL record initializes the page, so no need to go further back
need_image = false;
break;
}
}
}
}
// If we didn't find any records for this, check if the request is beyond EOF
if need_image
&& reconstruct_data.records.is_empty()
&& self.seg.rel.is_blocky()
&& blknum >= inner.get_seg_size(lsn)?
{
return Ok(PageReconstructResult::Missing(self.start_lsn));
}
// release metadata lock and close the file
}
// If an older page image is needed to reconstruct the page, let the
// caller know about the predecessor layer.
// caller know.
if need_image {
if let Some(cont_layer) = &self.predecessor {
Ok(PageReconstructResult::Continue(
self.start_lsn,
Arc::clone(cont_layer),
))
} else {
Ok(PageReconstructResult::Missing(self.start_lsn))
}
Ok(PageReconstructResult::Continue(Lsn(self.start_lsn.0 - 1)))
} else {
Ok(PageReconstructResult::Complete)
}
}
/// Get size of the relation at given LSN
fn get_seg_size(&self, lsn: Lsn) -> Result<u32> {
fn get_seg_size(&self, lsn: Lsn) -> Result<SegmentBlk> {
assert!(lsn >= self.start_lsn);
ensure!(
self.seg.rel.is_blocky(),
"get_seg_size() called on a non-blocky rel"
);
// Scan the BTreeMap backwards, starting from the given entry.
let inner = self.load()?;
let mut iter = inner.relsizes.range((Included(&Lsn(0)), Included(&lsn)));
let result;
if let Some((_entry_lsn, entry)) = iter.next_back() {
result = *entry;
// Use the base image if needed
} else if let Some(predecessor) = &self.predecessor {
result = predecessor.get_seg_size(lsn)?;
} else {
result = 0;
}
Ok(result)
inner.get_seg_size(lsn)
}
/// Does this segment exist at given LSN?
@@ -307,17 +318,20 @@ impl Layer for DeltaLayer {
///
fn unload(&self) -> Result<()> {
let mut inner = self.inner.lock().unwrap();
inner.page_version_metas = BTreeMap::new();
inner.relsizes = BTreeMap::new();
inner.page_version_metas = VecMap::default();
inner.seg_sizes = VecMap::default();
inner.loaded = false;
// Note: we keep the Book open. Is that a good idea? The virtual file
// machinery has its own rules for closing the file descriptor if it's not
// needed, but the Book struct uses up some memory, too.
Ok(())
}
fn delete(&self) -> Result<()> {
// delete underlying file
if let Some(path) = self.path() {
fs::remove_file(path)?;
}
fs::remove_file(self.path())?;
Ok(())
}
@@ -325,6 +339,10 @@ impl Layer for DeltaLayer {
true
}
fn is_in_memory(&self) -> bool {
false
}
/// debugging function to print out the contents of the layer
fn dump(&self) -> Result<()> {
println!(
@@ -332,34 +350,41 @@ impl Layer for DeltaLayer {
self.tenantid, self.timelineid, self.seg, self.start_lsn, self.end_lsn
);
println!("--- relsizes ---");
println!("--- seg sizes ---");
let inner = self.load()?;
for (k, v) in inner.relsizes.iter() {
for (k, v) in inner.seg_sizes.as_slice() {
println!(" {}: {}", k, v);
}
println!("--- page versions ---");
let (_path, book) = self.open_book()?;
let path = self.path();
let file = std::fs::File::open(&path)?;
let book = Book::new(file)?;
let chapter = book.chapter_reader(PAGE_VERSIONS_CHAPTER)?;
for (k, v) in inner.page_version_metas.iter() {
for ((blk, lsn), blob_range) in inner.page_version_metas.as_slice() {
let mut desc = String::new();
if let Some(page_image_range) = v.page_image_range.as_ref() {
let image = read_blob(&chapter, page_image_range)?;
write!(&mut desc, " img {} bytes", image.len())?;
let buf = read_blob(&chapter, blob_range)?;
let pv = PageVersion::des(&buf)?;
match pv {
PageVersion::Page(img) => {
write!(&mut desc, " img {} bytes", img.len())?;
}
PageVersion::Wal(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec);
write!(
&mut desc,
" rec {} bytes will_init: {} {}",
blob_range.size,
rec.will_init(),
wal_desc
)?;
}
}
if let Some(record_range) = v.record_range.as_ref() {
let record_bytes = read_blob(&chapter, record_range)?;
let rec = WALRecord::des(&record_bytes)?;
let wal_desc = waldecoder::describe_wal_record(&rec.rec);
write!(
&mut desc,
" rec {} bytes will_init: {} {}",
rec.rec.len(),
rec.will_init,
wal_desc
)?;
}
println!(" blk {} at {}: {}", k.0, k.1, desc);
println!(" blk {} at {}: {}", blk, lsn, desc);
}
Ok(())
@@ -381,137 +406,6 @@ impl DeltaLayer {
}
}
/// Create a new delta file, using the given btreemaps containing the page versions and
/// relsizes.
///
/// This is used to write the in-memory layer to disk. The in-memory layer uses the same
/// data structure with two btreemaps as we do, so passing the btreemaps is currently
/// expedient.
#[allow(clippy::too_many_arguments)]
pub fn create(
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
start_lsn: Lsn,
end_lsn: Lsn,
dropped: bool,
predecessor: Option<Arc<dyn Layer>>,
page_versions: BTreeMap<(u32, Lsn), PageVersion>,
relsizes: BTreeMap<Lsn, u32>,
) -> Result<DeltaLayer> {
let delta_layer = DeltaLayer {
path_or_conf: PathOrConf::Conf(conf),
timelineid,
tenantid,
seg,
start_lsn,
end_lsn,
dropped,
inner: Mutex::new(DeltaLayerInner {
loaded: true,
page_version_metas: BTreeMap::new(),
relsizes,
}),
predecessor,
};
let mut inner = delta_layer.inner.lock().unwrap();
// Write the in-memory btreemaps into a file
let path = delta_layer
.path()
.expect("DeltaLayer is supposed to have a layer path on disk");
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let file = File::create(&path)?;
let buf_writer = BufWriter::new(file);
let book = BookWriter::new(buf_writer, DELTA_FILE_MAGIC)?;
let mut page_version_writer = BlobWriter::new(book, PAGE_VERSIONS_CHAPTER);
for (key, page_version) in page_versions {
let page_image_range = page_version
.page_image
.map(|page_image| page_version_writer.write_blob(page_image.as_ref()))
.transpose()?;
let record_range = page_version
.record
.map(|record| {
let buf = WALRecord::ser(&record)?;
page_version_writer.write_blob(&buf)
})
.transpose()?;
let old = inner.page_version_metas.insert(
key,
PageVersionMeta {
page_image_range,
record_range,
},
);
assert!(old.is_none());
}
let book = page_version_writer.close()?;
// Write out page versions
let mut chapter = book.new_chapter(PAGE_VERSION_METAS_CHAPTER);
let buf = BTreeMap::ser(&inner.page_version_metas)?;
chapter.write_all(&buf)?;
let book = chapter.close()?;
// and relsizes to separate chapter
let mut chapter = book.new_chapter(REL_SIZES_CHAPTER);
let buf = BTreeMap::ser(&inner.relsizes)?;
chapter.write_all(&buf)?;
let book = chapter.close()?;
let mut chapter = book.new_chapter(SUMMARY_CHAPTER);
let summary = Summary {
tenantid,
timelineid,
seg,
start_lsn,
end_lsn,
dropped,
};
Summary::ser_into(&summary, &mut chapter)?;
let book = chapter.close()?;
// This flushes the underlying 'buf_writer'.
book.close()?;
trace!("saved {}", &path.display());
drop(inner);
Ok(delta_layer)
}
fn open_book(&self) -> Result<(PathBuf, Book<File>)> {
let path = Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
},
);
let file = File::open(&path)?;
let book = Book::new(file)?;
Ok((path, book))
}
///
/// Load the contents of the file into memory
///
@@ -523,7 +417,14 @@ impl DeltaLayer {
return Ok(inner);
}
let (path, book) = self.open_book()?;
let path = self.path();
// Open the file if it's not open already.
if inner.book.is_none() {
let file = VirtualFile::open(&path)?;
inner.book = Some(Book::new(file)?);
}
let book = inner.book.as_ref().unwrap();
match &self.path_or_conf {
PathOrConf::Conf(_) => {
@@ -551,18 +452,16 @@ impl DeltaLayer {
}
let chapter = book.read_chapter(PAGE_VERSION_METAS_CHAPTER)?;
let page_version_metas = BTreeMap::des(&chapter)?;
let page_version_metas = VecMap::des(&chapter)?;
let chapter = book.read_chapter(REL_SIZES_CHAPTER)?;
let relsizes = BTreeMap::des(&chapter)?;
let chapter = book.read_chapter(SEG_SIZES_CHAPTER)?;
let seg_sizes = VecMap::des(&chapter)?;
debug!("loaded from {}", &path.display());
*inner = DeltaLayerInner {
loaded: true,
page_version_metas,
relsizes,
};
inner.page_version_metas = page_version_metas;
inner.seg_sizes = seg_sizes;
inner.loaded = true;
Ok(inner)
}
@@ -573,7 +472,6 @@ impl DeltaLayer {
timelineid: ZTimelineId,
tenantid: ZTenantId,
filename: &DeltaFileName,
predecessor: Option<Arc<dyn Layer>>,
) -> DeltaLayer {
DeltaLayer {
path_or_conf: PathOrConf::Conf(conf),
@@ -585,17 +483,20 @@ impl DeltaLayer {
dropped: filename.dropped,
inner: Mutex::new(DeltaLayerInner {
loaded: false,
page_version_metas: BTreeMap::new(),
relsizes: BTreeMap::new(),
book: None,
page_version_metas: VecMap::default(),
seg_sizes: VecMap::default(),
}),
predecessor,
}
}
/// Create a DeltaLayer struct representing an existing file on disk.
///
/// This variant is only used for debugging purposes, by the 'dump_layerfile' binary.
pub fn new_for_path(path: &Path, book: &Book<File>) -> Result<Self> {
pub fn new_for_path<F>(path: &Path, book: &Book<F>) -> Result<Self>
where
F: std::os::unix::prelude::FileExt,
{
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let summary = Summary::des(&chapter)?;
@@ -609,10 +510,196 @@ impl DeltaLayer {
dropped: summary.dropped,
inner: Mutex::new(DeltaLayerInner {
loaded: false,
page_version_metas: BTreeMap::new(),
relsizes: BTreeMap::new(),
book: None,
page_version_metas: VecMap::default(),
seg_sizes: VecMap::default(),
}),
predecessor: None,
})
}
fn layer_name(&self) -> DeltaFileName {
DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
}
}
/// Path to the layer file in pageserver workdir.
pub fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&self.layer_name(),
)
}
}
/// A builder object for constructing a new delta layer.
///
/// Usage:
///
/// 1. Create the DeltaLayerWriter by calling DeltaLayerWriter::new(...)
///
/// 2. Write the contents by calling `put_page_version` for every page
/// version to store in the layer.
///
/// 3. Call `finish`.
///
pub struct DeltaLayerWriter {
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
start_lsn: Lsn,
end_lsn: Lsn,
dropped: bool,
page_version_writer: ChapterWriter<BufWriter<VirtualFile>>,
pv_offset: u64,
page_version_metas: VecMap<(SegmentBlk, Lsn), BlobRange>,
}
impl DeltaLayerWriter {
///
/// Start building a new delta layer.
///
pub fn new(
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
start_lsn: Lsn,
end_lsn: Lsn,
dropped: bool,
) -> Result<DeltaLayerWriter> {
// Create the file
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let path = DeltaLayer::path_for(
&PathOrConf::Conf(conf),
timelineid,
tenantid,
&DeltaFileName {
seg,
start_lsn,
end_lsn,
dropped,
},
);
let file = VirtualFile::create(&path)?;
let buf_writer = BufWriter::new(file);
let book = BookWriter::new(buf_writer, DELTA_FILE_MAGIC)?;
// Open the page-versions chapter for writing. The calls to
// `put_page_version` will use this to write the contents.
let page_version_writer = book.new_chapter(PAGE_VERSIONS_CHAPTER);
Ok(DeltaLayerWriter {
conf,
timelineid,
tenantid,
seg,
start_lsn,
end_lsn,
dropped,
page_version_writer,
page_version_metas: VecMap::default(),
pv_offset: 0,
})
}
///
/// Append a page version to the file.
///
/// 'buf' is a serialized PageVersion.
/// The page versions must be appended in blknum, lsn order.
///
pub fn put_page_version(&mut self, blknum: SegmentBlk, lsn: Lsn, buf: &[u8]) -> Result<()> {
// Remember the offset and size metadata. The metadata is written
// to a separate chapter, in `finish`.
let blob_range = BlobRange {
offset: self.pv_offset,
size: buf.len(),
};
self.page_version_metas
.append((blknum, lsn), blob_range)
.unwrap();
// write the page version
self.page_version_writer.write_all(buf)?;
self.pv_offset += buf.len() as u64;
Ok(())
}
///
/// Finish writing the delta layer.
///
/// 'seg_sizes' is a list of size changes to store with the actual data.
///
pub fn finish(self, seg_sizes: VecMap<Lsn, SegmentBlk>) -> Result<DeltaLayer> {
// Close the page-versions chapter
let book = self.page_version_writer.close()?;
// Write out page versions metadata
let mut chapter = book.new_chapter(PAGE_VERSION_METAS_CHAPTER);
let buf = VecMap::ser(&self.page_version_metas)?;
chapter.write_all(&buf)?;
let book = chapter.close()?;
if self.seg.rel.is_blocky() {
assert!(!seg_sizes.is_empty());
}
// and seg_sizes to separate chapter
let mut chapter = book.new_chapter(SEG_SIZES_CHAPTER);
let buf = VecMap::ser(&seg_sizes)?;
chapter.write_all(&buf)?;
let book = chapter.close()?;
let mut chapter = book.new_chapter(SUMMARY_CHAPTER);
let summary = Summary {
tenantid: self.tenantid,
timelineid: self.timelineid,
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
};
Summary::ser_into(&summary, &mut chapter)?;
let book = chapter.close()?;
// This flushes the underlying 'buf_writer'.
book.close()?;
// Note: Because we opened the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.book here. The first read will have to re-open it.
let layer = DeltaLayer {
path_or_conf: PathOrConf::Conf(self.conf),
tenantid: self.tenantid,
timelineid: self.timelineid,
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
inner: Mutex::new(DeltaLayerInner {
loaded: false,
book: None,
page_version_metas: VecMap::default(),
seg_sizes: VecMap::default(),
}),
};
trace!("created delta layer {}", &layer.path().display());
Ok(layer)
}
}

View File

@@ -0,0 +1,310 @@
//! Implementation of append-only file data structure
//! used to keep in-memory layers spilled on disk.
use crate::config::PageServerConf;
use crate::page_cache;
use crate::page_cache::PAGE_SZ;
use crate::page_cache::{ReadBufResult, WriteBufResult};
use crate::virtual_file::VirtualFile;
use lazy_static::lazy_static;
use std::cmp::min;
use std::collections::HashMap;
use std::fs::OpenOptions;
use std::io::{Error, ErrorKind, Seek, SeekFrom, Write};
use std::ops::DerefMut;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
use std::os::unix::fs::FileExt;
lazy_static! {
///
/// This is the global cache of file descriptors (File objects).
///
static ref EPHEMERAL_FILES: RwLock<EphemeralFiles> = RwLock::new(EphemeralFiles {
next_file_id: 1,
files: HashMap::new(),
});
}
pub struct EphemeralFiles {
next_file_id: u64,
files: HashMap<u64, Arc<VirtualFile>>,
}
pub struct EphemeralFile {
file_id: u64,
_tenantid: ZTenantId,
_timelineid: ZTimelineId,
file: Arc<VirtualFile>,
pos: u64,
}
impl EphemeralFile {
pub fn create(
conf: &PageServerConf,
tenantid: ZTenantId,
timelineid: ZTimelineId,
) -> Result<EphemeralFile, std::io::Error> {
let mut l = EPHEMERAL_FILES.write().unwrap();
let file_id = l.next_file_id;
l.next_file_id += 1;
let filename = conf
.timeline_path(&timelineid, &tenantid)
.join(PathBuf::from(format!("ephemeral-{}", file_id)));
let file = VirtualFile::open_with_options(
&filename,
OpenOptions::new().read(true).write(true).create(true),
)?;
let file_rc = Arc::new(file);
l.files.insert(file_id, file_rc.clone());
Ok(EphemeralFile {
file_id,
_tenantid: tenantid,
_timelineid: timelineid,
file: file_rc,
pos: 0,
})
}
pub fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), Error> {
let mut off = 0;
while off < PAGE_SZ {
let n = self
.file
.read_at(&mut buf[off..], blkno as u64 * PAGE_SZ as u64 + off as u64)?;
if n == 0 {
// Reached EOF. Fill the rest of the buffer with zeros.
const ZERO_BUF: [u8; PAGE_SZ] = [0u8; PAGE_SZ];
buf[off..].copy_from_slice(&ZERO_BUF[off..]);
break;
}
off += n as usize;
}
Ok(())
}
}
/// Does the given filename look like an ephemeral file?
pub fn is_ephemeral_file(filename: &str) -> bool {
if let Some(rest) = filename.strip_prefix("ephemeral-") {
rest.parse::<u32>().is_ok()
} else {
false
}
}
impl FileExt for EphemeralFile {
fn read_at(&self, dstbuf: &mut [u8], offset: u64) -> Result<usize, Error> {
// Look up the right page
let blkno = (offset / PAGE_SZ as u64) as u32;
let off = offset as usize % PAGE_SZ;
let len = min(PAGE_SZ - off, dstbuf.len());
let read_guard;
let mut write_guard;
let cache = page_cache::get();
let buf = match cache.read_ephemeral_buf(self.file_id, blkno) {
ReadBufResult::Found(guard) => {
read_guard = guard;
read_guard.as_ref()
}
ReadBufResult::NotFound(guard) => {
// Read the page from disk into the buffer
write_guard = guard;
self.fill_buffer(write_guard.deref_mut(), blkno)?;
write_guard.mark_valid();
// And then fall through to read the requested slice from the
// buffer.
write_guard.as_ref()
}
};
dstbuf[0..len].copy_from_slice(&buf[off..(off + len)]);
Ok(len)
}
fn write_at(&self, srcbuf: &[u8], offset: u64) -> Result<usize, Error> {
// Look up the right page
let blkno = (offset / PAGE_SZ as u64) as u32;
let off = offset as usize % PAGE_SZ;
let len = min(PAGE_SZ - off, srcbuf.len());
let mut write_guard;
let cache = page_cache::get();
let buf = match cache.write_ephemeral_buf(self.file_id, blkno) {
WriteBufResult::Found(guard) => {
write_guard = guard;
write_guard.deref_mut()
}
WriteBufResult::NotFound(guard) => {
// Read the page from disk into the buffer
// TODO: if we're overwriting the whole page, no need to read it in first
write_guard = guard;
self.fill_buffer(write_guard.deref_mut(), blkno)?;
write_guard.mark_valid();
// And then fall through to modify it.
write_guard.deref_mut()
}
};
buf[off..(off + len)].copy_from_slice(&srcbuf[0..len]);
write_guard.mark_dirty();
Ok(len)
}
}
impl Write for EphemeralFile {
fn write(&mut self, buf: &[u8]) -> Result<usize, Error> {
let n = self.write_at(buf, self.pos)?;
self.pos += n as u64;
Ok(n)
}
fn flush(&mut self) -> Result<(), std::io::Error> {
// we don't need to flush data:
// * we either write input bytes or not, not keeping any intermediate data buffered
// * rust unix file `flush` impl does not flush things either, returning `Ok(())`
Ok(())
}
}
impl Seek for EphemeralFile {
fn seek(&mut self, pos: SeekFrom) -> Result<u64, Error> {
match pos {
SeekFrom::Start(offset) => {
self.pos = offset;
}
SeekFrom::End(_offset) => {
return Err(Error::new(
ErrorKind::Other,
"SeekFrom::End not supported by EphemeralFile",
));
}
SeekFrom::Current(offset) => {
let pos = self.pos as i128 + offset as i128;
if pos < 0 {
return Err(Error::new(
ErrorKind::InvalidInput,
"offset would be negative",
));
}
if pos > u64::MAX as i128 {
return Err(Error::new(ErrorKind::InvalidInput, "offset overflow"));
}
self.pos = pos as u64;
}
}
Ok(self.pos)
}
}
impl Drop for EphemeralFile {
fn drop(&mut self) {
// drop all pages from page cache
let cache = page_cache::get();
cache.drop_buffers_for_ephemeral(self.file_id);
// remove entry from the hash map
EPHEMERAL_FILES.write().unwrap().files.remove(&self.file_id);
// unlink file
// FIXME: print error
let _ = std::fs::remove_file(&self.file.path);
}
}
pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), std::io::Error> {
if let Some(file) = EPHEMERAL_FILES.read().unwrap().files.get(&file_id) {
file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64)?;
Ok(())
} else {
Err(std::io::Error::new(
ErrorKind::Other,
"could not write back page, not found in ephemeral files hash",
))
}
}
#[cfg(test)]
mod tests {
use super::*;
use rand::seq::SliceRandom;
use rand::thread_rng;
use std::fs;
use std::str::FromStr;
fn repo_harness(
test_name: &str,
) -> Result<(&'static PageServerConf, ZTenantId, ZTimelineId), Error> {
let repo_dir = PageServerConf::test_repo_dir(test_name);
let _ = fs::remove_dir_all(&repo_dir);
let conf = PageServerConf::dummy_conf(repo_dir);
// Make a static copy of the config. This can never be free'd, but that's
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
let tenantid = ZTenantId::from_str("11000000000000000000000000000000").unwrap();
let timelineid = ZTimelineId::from_str("22000000000000000000000000000000").unwrap();
fs::create_dir_all(conf.timeline_path(&timelineid, &tenantid))?;
Ok((conf, tenantid, timelineid))
}
// Helper function to slurp contents of a file, starting at the current position,
// into a string
fn read_string(efile: &EphemeralFile, offset: u64, len: usize) -> Result<String, Error> {
let mut buf = Vec::new();
buf.resize(len, 0u8);
efile.read_exact_at(&mut buf, offset)?;
Ok(String::from_utf8_lossy(&buf)
.trim_end_matches('\0')
.to_string())
}
#[test]
fn test_ephemeral_files() -> Result<(), Error> {
let (conf, tenantid, timelineid) = repo_harness("ephemeral_files")?;
let mut file_a = EphemeralFile::create(conf, tenantid, timelineid)?;
file_a.write_all(b"foo")?;
assert_eq!("foo", read_string(&file_a, 0, 20)?);
file_a.write_all(b"bar")?;
assert_eq!("foobar", read_string(&file_a, 0, 20)?);
// Open a lot of files, enough to cause some page evictions.
let mut efiles = Vec::new();
for fileno in 0..100 {
let mut efile = EphemeralFile::create(conf, tenantid, timelineid)?;
efile.write_all(format!("file {}", fileno).as_bytes())?;
assert_eq!(format!("file {}", fileno), read_string(&efile, 0, 10)?);
efiles.push((fileno, efile));
}
// Check that all the files can still be read from. Use them in random order for
// good measure.
efiles.as_mut_slice().shuffle(&mut thread_rng());
for (fileno, efile) in efiles.iter_mut() {
assert_eq!(format!("file {}", fileno), read_string(efile, 0, 10)?);
}
Ok(())
}
}

View File

@@ -1,16 +1,12 @@
//!
//! Helper functions for dealing with filenames of the image and delta layer files.
//!
use crate::config::PageServerConf;
use crate::layered_repository::storage_layer::SegmentTag;
use crate::relish::*;
use crate::PageServerConf;
use crate::{ZTenantId, ZTimelineId};
use std::fmt;
use std::fs;
use std::path::PathBuf;
use anyhow::Result;
use log::*;
use zenith_utils::lsn::Lsn;
// Note: LayeredTimeline::load_layer_map() relies on this sort order
@@ -35,7 +31,7 @@ impl DeltaFileName {
/// Parse a string as a delta file name. Returns None if the filename does not
/// match the expected pattern.
///
pub fn from_str(fname: &str) -> Option<Self> {
pub fn parse_str(fname: &str) -> Option<Self> {
let rel;
let mut parts;
if let Some(rest) = fname.strip_prefix("rel_") {
@@ -168,7 +164,7 @@ impl ImageFileName {
/// Parse a string as an image file name. Returns None if the filename does not
/// match the expected pattern.
///
pub fn from_str(fname: &str) -> Option<Self> {
pub fn parse_str(fname: &str) -> Option<Self> {
let rel;
let mut parts;
if let Some(rest) = fname.strip_prefix("rel_") {
@@ -269,36 +265,6 @@ impl fmt::Display for ImageFileName {
}
}
/// Scan timeline directory and create ImageFileName and DeltaFilename
/// structs representing all files on disk
///
/// TODO: returning an Iterator would be more idiomatic
pub fn list_files(
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
) -> Result<(Vec<ImageFileName>, Vec<DeltaFileName>)> {
let path = conf.timeline_path(&timelineid, &tenantid);
let mut deltafiles: Vec<DeltaFileName> = Vec::new();
let mut imgfiles: Vec<ImageFileName> = Vec::new();
for direntry in fs::read_dir(path)? {
let fname = direntry?.file_name();
let fname = fname.to_str().unwrap();
if let Some(deltafilename) = DeltaFileName::from_str(fname) {
deltafiles.push(deltafilename);
} else if let Some(imgfilename) = ImageFileName::from_str(fname) {
imgfiles.push(imgfilename);
} else if fname == "wal" || fname == "metadata" || fname == "ancestor" {
// ignore these
} else {
warn!("unrecognized filename in timeline dir: {}", fname);
}
}
Ok((imgfiles, deltafiles))
}
/// Helper enum to hold a PageServerConf, or a path
///
/// This is used by DeltaLayer and ImageLayer. Normally, this holds a reference to the

View File

@@ -0,0 +1,142 @@
//!
//! Global registry of open layers.
//!
//! Whenever a new in-memory layer is created to hold incoming WAL, it is registered
//! in [`GLOBAL_LAYER_MAP`], so that we can keep track of the total number of
//! in-memory layers in the system, and know when we need to evict some to release
//! memory.
//!
//! Each layer is assigned a unique ID when it's registered in the global registry.
//! The ID can be used to relocate the layer later, without having to hold locks.
//!
use std::sync::atomic::{AtomicU8, Ordering};
use std::sync::{Arc, RwLock};
use super::inmemory_layer::InMemoryLayer;
use lazy_static::lazy_static;
const MAX_USAGE_COUNT: u8 = 5;
lazy_static! {
pub static ref GLOBAL_LAYER_MAP: RwLock<InMemoryLayers> =
RwLock::new(InMemoryLayers::default());
}
// TODO these types can probably be smaller
#[derive(PartialEq, Eq, Clone, Copy)]
pub struct LayerId {
index: usize,
tag: u64, // to avoid ABA problem
}
enum SlotData {
Occupied(Arc<InMemoryLayer>),
/// Vacant slots form a linked list, the value is the index
/// of the next vacant slot in the list.
Vacant(Option<usize>),
}
struct Slot {
tag: u64,
data: SlotData,
usage_count: AtomicU8, // for clock algorithm
}
#[derive(Default)]
pub struct InMemoryLayers {
slots: Vec<Slot>,
num_occupied: usize,
// Head of free-slot list.
next_empty_slot_idx: Option<usize>,
}
impl InMemoryLayers {
pub fn insert(&mut self, layer: Arc<InMemoryLayer>) -> LayerId {
let slot_idx = match self.next_empty_slot_idx {
Some(slot_idx) => slot_idx,
None => {
let idx = self.slots.len();
self.slots.push(Slot {
tag: 0,
data: SlotData::Vacant(None),
usage_count: AtomicU8::new(0),
});
idx
}
};
let slots_len = self.slots.len();
let slot = &mut self.slots[slot_idx];
match slot.data {
SlotData::Occupied(_) => {
panic!("an occupied slot was in the free list");
}
SlotData::Vacant(next_empty_slot_idx) => {
self.next_empty_slot_idx = next_empty_slot_idx;
}
}
slot.data = SlotData::Occupied(layer);
slot.usage_count.store(1, Ordering::Relaxed);
self.num_occupied += 1;
assert!(self.num_occupied <= slots_len);
LayerId {
index: slot_idx,
tag: slot.tag,
}
}
pub fn get(&self, layer_id: &LayerId) -> Option<Arc<InMemoryLayer>> {
let slot = self.slots.get(layer_id.index)?; // TODO should out of bounds indexes just panic?
if slot.tag != layer_id.tag {
return None;
}
if let SlotData::Occupied(layer) = &slot.data {
let _ = slot.usage_count.fetch_update(
Ordering::Relaxed,
Ordering::Relaxed,
|old_usage_count| {
if old_usage_count < MAX_USAGE_COUNT {
Some(old_usage_count + 1)
} else {
None
}
},
);
Some(Arc::clone(layer))
} else {
None
}
}
// TODO this won't be a public API in the future
pub fn remove(&mut self, layer_id: &LayerId) {
let slot = &mut self.slots[layer_id.index];
if slot.tag != layer_id.tag {
return;
}
match &slot.data {
SlotData::Occupied(_layer) => {
// TODO evict the layer
}
SlotData::Vacant(_) => unimplemented!(),
}
slot.data = SlotData::Vacant(self.next_empty_slot_idx);
self.next_empty_slot_idx = Some(layer_id.index);
assert!(self.num_occupied > 0);
self.num_occupied -= 1;
slot.tag = slot.tag.wrapping_add(1);
}
}

View File

@@ -21,26 +21,25 @@
//!
//! For non-blocky relishes, the image can be found in NONBLOCKY_IMAGE_CHAPTER.
//!
use crate::config::PageServerConf;
use crate::layered_repository::filename::{ImageFileName, PathOrConf};
use crate::layered_repository::storage_layer::{
Layer, PageReconstructData, PageReconstructResult, SegmentTag,
Layer, PageReconstructData, PageReconstructResult, SegmentBlk, SegmentTag,
};
use crate::layered_repository::LayeredTimeline;
use crate::layered_repository::RELISH_SEG_SIZE;
use crate::PageServerConf;
use crate::virtual_file::VirtualFile;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{anyhow, bail, ensure, Result};
use anyhow::{anyhow, bail, ensure, Context, Result};
use bytes::Bytes;
use log::*;
use serde::{Deserialize, Serialize};
use std::convert::TryInto;
use std::fs;
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::{Path, PathBuf};
use std::sync::{Mutex, MutexGuard};
use bookfile::{Book, BookWriter};
use bookfile::{Book, BookWriter, ChapterWriter};
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
@@ -99,14 +98,13 @@ pub struct ImageLayer {
#[derive(Clone)]
enum ImageType {
Blocky { num_blocks: u32 },
Blocky { num_blocks: SegmentBlk },
NonBlocky,
}
pub struct ImageLayerInner {
/// If false, the 'image_type' has not been
/// loaded into memory yet.
loaded: bool,
/// If None, the 'image_type' has not been loaded into memory yet.
book: Option<Book<VirtualFile>>,
/// Derived from filename and bookfile chapter metadata
image_type: ImageType,
@@ -114,25 +112,11 @@ pub struct ImageLayerInner {
impl Layer for ImageLayer {
fn filename(&self) -> PathBuf {
PathBuf::from(
ImageFileName {
seg: self.seg,
lsn: self.lsn,
}
.to_string(),
)
PathBuf::from(self.layer_name().to_string())
}
fn path(&self) -> Option<PathBuf> {
Some(Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&ImageFileName {
seg: self.seg,
lsn: self.lsn,
},
))
fn get_tenant_id(&self) -> ZTenantId {
self.tenantid
}
fn get_timeline_id(&self) -> ZTimelineId {
@@ -152,41 +136,62 @@ impl Layer for ImageLayer {
}
fn get_end_lsn(&self) -> Lsn {
self.lsn
// End-bound is exclusive
self.lsn + 1
}
/// Look up given page in the file
fn get_page_reconstruct_data(
&self,
blknum: u32,
blknum: SegmentBlk,
lsn: Lsn,
cached_img_lsn: Option<Lsn>,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult> {
assert!((0..RELISH_SEG_SIZE).contains(&blknum));
assert!(lsn >= self.lsn);
match cached_img_lsn {
Some(cached_lsn) if self.lsn <= cached_lsn => return Ok(PageReconstructResult::Cached),
_ => {}
}
let inner = self.load()?;
let base_blknum = blknum % RELISH_SEG_SIZE;
let (_path, book) = self.open_book()?;
let buf = match &inner.image_type {
ImageType::Blocky { num_blocks } => {
if base_blknum >= *num_blocks {
// Check if the request is beyond EOF
if blknum >= *num_blocks {
return Ok(PageReconstructResult::Missing(lsn));
}
let mut buf = vec![0u8; BLOCK_SIZE];
let offset = BLOCK_SIZE as u64 * base_blknum as u64;
let offset = BLOCK_SIZE as u64 * blknum as u64;
let chapter = book.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
chapter.read_exact_at(&mut buf, offset)?;
let chapter = inner
.book
.as_ref()
.unwrap()
.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
chapter.read_exact_at(&mut buf, offset).with_context(|| {
format!(
"failed to read page from data file {} at offset {}",
self.filename().display(),
offset
)
})?;
buf
}
ImageType::NonBlocky => {
ensure!(base_blknum == 0);
book.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?.into_vec()
ensure!(blknum == 0);
inner
.book
.as_ref()
.unwrap()
.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?
.into_vec()
}
};
@@ -195,7 +200,7 @@ impl Layer for ImageLayer {
}
/// Get size of the segment
fn get_seg_size(&self, _lsn: Lsn) -> Result<u32> {
fn get_seg_size(&self, _lsn: Lsn) -> Result<SegmentBlk> {
let inner = self.load()?;
match inner.image_type {
ImageType::Blocky { num_blocks } => Ok(num_blocks),
@@ -208,22 +213,13 @@ impl Layer for ImageLayer {
Ok(true)
}
///
/// Release most of the memory used by this layer. If it's accessed again later,
/// it will need to be loaded back.
///
fn unload(&self) -> Result<()> {
let mut inner = self.inner.lock().unwrap();
inner.image_type = ImageType::Blocky { num_blocks: 0 };
inner.loaded = false;
Ok(())
}
fn delete(&self) -> Result<()> {
// delete underlying file
if let Some(path) = self.path() {
fs::remove_file(path)?;
}
fs::remove_file(self.path())?;
Ok(())
}
@@ -231,6 +227,10 @@ impl Layer for ImageLayer {
false
}
fn is_in_memory(&self) -> bool {
false
}
/// debugging function to print out the contents of the layer
fn dump(&self) -> Result<()> {
println!(
@@ -243,8 +243,11 @@ impl Layer for ImageLayer {
match inner.image_type {
ImageType::Blocky { num_blocks } => println!("({}) blocks ", num_blocks),
ImageType::NonBlocky => {
let (_path, book) = self.open_book()?;
let chapter = book.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?;
let chapter = inner
.book
.as_ref()
.unwrap()
.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?;
println!("non-blocky ({} bytes)", chapter.len());
}
}
@@ -268,121 +271,6 @@ impl ImageLayer {
}
}
/// Create a new image file, using the given array of pages.
fn create(
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
lsn: Lsn,
base_images: Vec<Bytes>,
) -> Result<ImageLayer> {
let image_type = if seg.rel.is_blocky() {
let num_blocks: u32 = base_images.len().try_into()?;
ImageType::Blocky { num_blocks }
} else {
assert_eq!(base_images.len(), 1);
ImageType::NonBlocky
};
let layer = ImageLayer {
path_or_conf: PathOrConf::Conf(conf),
timelineid,
tenantid,
seg,
lsn,
inner: Mutex::new(ImageLayerInner {
loaded: true,
image_type: image_type.clone(),
}),
};
let inner = layer.inner.lock().unwrap();
// Write the images into a file
let path = layer
.path()
.expect("ImageLayer is supposed to have a layer path on disk");
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let file = File::create(&path)?;
let buf_writer = BufWriter::new(file);
let book = BookWriter::new(buf_writer, IMAGE_FILE_MAGIC)?;
let book = match &image_type {
ImageType::Blocky { .. } => {
let mut chapter = book.new_chapter(BLOCKY_IMAGES_CHAPTER);
for block_bytes in base_images {
assert_eq!(block_bytes.len(), BLOCK_SIZE);
chapter.write_all(&block_bytes)?;
}
chapter.close()?
}
ImageType::NonBlocky => {
let mut chapter = book.new_chapter(NONBLOCKY_IMAGE_CHAPTER);
chapter.write_all(&base_images[0])?;
chapter.close()?
}
};
let mut chapter = book.new_chapter(SUMMARY_CHAPTER);
let summary = Summary {
tenantid,
timelineid,
seg,
lsn,
};
Summary::ser_into(&summary, &mut chapter)?;
let book = chapter.close()?;
// This flushes the underlying 'buf_writer'.
book.close()?;
trace!("saved {}", &path.display());
drop(inner);
Ok(layer)
}
// Create a new image file by materializing every page in a source layer
// at given LSN.
pub fn create_from_src(
conf: &'static PageServerConf,
timeline: &LayeredTimeline,
src: &dyn Layer,
lsn: Lsn,
) -> Result<ImageLayer> {
let seg = src.get_seg_tag();
let timelineid = timeline.timelineid;
let startblk;
let size;
if seg.rel.is_blocky() {
size = src.get_seg_size(lsn)?;
startblk = seg.segno * RELISH_SEG_SIZE;
} else {
size = 1;
startblk = 0;
}
trace!(
"creating new ImageLayer for {} on timeline {} at {}",
seg,
timelineid,
lsn,
);
let mut base_images: Vec<Bytes> = Vec::new();
for blknum in startblk..(startblk + size) {
let img = timeline.materialize_page(seg, blknum, lsn, &*src)?;
base_images.push(img);
}
Self::create(conf, timelineid, timeline.tenantid, seg, lsn, base_images)
}
///
/// Load the contents of the file into memory
///
@@ -390,11 +278,19 @@ impl ImageLayer {
// quick exit if already loaded
let mut inner = self.inner.lock().unwrap();
if inner.loaded {
if inner.book.is_some() {
return Ok(inner);
}
let (path, book) = self.open_book()?;
let path = self.path();
let file = VirtualFile::open(&path)
.with_context(|| format!("Failed to open virtual file '{}'", path.display()))?;
let book = Book::new(file).with_context(|| {
format!(
"Failed to open virtual file '{}' as a bookfile",
path.display()
)
})?;
match &self.path_or_conf {
PathOrConf::Conf(_) => {
@@ -425,7 +321,7 @@ impl ImageLayer {
let chapter = book.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
let images_len = chapter.len();
ensure!(images_len % BLOCK_SIZE as u64 == 0);
let num_blocks: u32 = (images_len / BLOCK_SIZE as u64).try_into()?;
let num_blocks: SegmentBlk = (images_len / BLOCK_SIZE as u64).try_into()?;
ImageType::Blocky { num_blocks }
} else {
let _chapter = book.chapter_reader(NONBLOCKY_IMAGE_CHAPTER)?;
@@ -435,30 +331,13 @@ impl ImageLayer {
debug!("loaded from {}", &path.display());
*inner = ImageLayerInner {
loaded: true,
book: Some(book),
image_type,
};
Ok(inner)
}
fn open_book(&self) -> Result<(PathBuf, Book<File>)> {
let path = Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&ImageFileName {
seg: self.seg,
lsn: self.lsn,
},
);
let file = File::open(&path)?;
let book = Book::new(file)?;
Ok((path, book))
}
/// Create an ImageLayer struct representing an existing file on disk
pub fn new(
conf: &'static PageServerConf,
@@ -473,7 +352,7 @@ impl ImageLayer {
seg: filename.seg,
lsn: filename.lsn,
inner: Mutex::new(ImageLayerInner {
loaded: false,
book: None,
image_type: ImageType::Blocky { num_blocks: 0 },
}),
}
@@ -482,7 +361,10 @@ impl ImageLayer {
/// Create an ImageLayer struct representing an existing file on disk.
///
/// This variant is only used for debugging purposes, by the 'dump_layerfile' binary.
pub fn new_for_path(path: &Path, book: &Book<File>) -> Result<ImageLayer> {
pub fn new_for_path<F>(path: &Path, book: &Book<F>) -> Result<ImageLayer>
where
F: std::os::unix::prelude::FileExt,
{
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let summary = Summary::des(&chapter)?;
@@ -493,9 +375,159 @@ impl ImageLayer {
seg: summary.seg,
lsn: summary.lsn,
inner: Mutex::new(ImageLayerInner {
loaded: false,
book: None,
image_type: ImageType::Blocky { num_blocks: 0 },
}),
})
}
fn layer_name(&self) -> ImageFileName {
ImageFileName {
seg: self.seg,
lsn: self.lsn,
}
}
/// Path to the layer file in pageserver workdir.
pub fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&self.layer_name(),
)
}
}
/// A builder object for constructing a new image layer.
///
/// Usage:
///
/// 1. Create the ImageLayerWriter by calling ImageLayerWriter::new(...)
///
/// 2. Write the contents by calling `put_page_image` for every page
/// in the segment.
///
/// 3. Call `finish`.
///
pub struct ImageLayerWriter {
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
lsn: Lsn,
num_blocks: SegmentBlk,
page_image_writer: ChapterWriter<BufWriter<VirtualFile>>,
num_blocks_written: SegmentBlk,
}
impl ImageLayerWriter {
pub fn new(
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
lsn: Lsn,
num_blocks: SegmentBlk,
) -> Result<ImageLayerWriter> {
// Create the file
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let path = ImageLayer::path_for(
&PathOrConf::Conf(conf),
timelineid,
tenantid,
&ImageFileName { seg, lsn },
);
let file = VirtualFile::create(&path)?;
let buf_writer = BufWriter::new(file);
let book = BookWriter::new(buf_writer, IMAGE_FILE_MAGIC)?;
// Open the page-images chapter for writing. The calls to
// `put_page_image` will use this to write the contents.
let chapter = if seg.rel.is_blocky() {
book.new_chapter(BLOCKY_IMAGES_CHAPTER)
} else {
assert_eq!(num_blocks, 1);
book.new_chapter(NONBLOCKY_IMAGE_CHAPTER)
};
let writer = ImageLayerWriter {
conf,
timelineid,
tenantid,
seg,
lsn,
num_blocks,
page_image_writer: chapter,
num_blocks_written: 0,
};
Ok(writer)
}
///
/// Write next page image to the file.
///
/// The page versions must be appended in blknum order.
///
pub fn put_page_image(&mut self, block_bytes: &[u8]) -> Result<()> {
assert!(self.num_blocks_written < self.num_blocks);
if self.seg.rel.is_blocky() {
assert_eq!(block_bytes.len(), BLOCK_SIZE);
}
self.page_image_writer.write_all(block_bytes)?;
self.num_blocks_written += 1;
Ok(())
}
pub fn finish(self) -> Result<ImageLayer> {
// Check that the `put_page_image' was called for every block.
assert!(self.num_blocks_written == self.num_blocks);
// Close the page-images chapter
let book = self.page_image_writer.close()?;
// Write out the summary chapter
let image_type = if self.seg.rel.is_blocky() {
ImageType::Blocky {
num_blocks: self.num_blocks,
}
} else {
ImageType::NonBlocky
};
let mut chapter = book.new_chapter(SUMMARY_CHAPTER);
let summary = Summary {
tenantid: self.tenantid,
timelineid: self.timelineid,
seg: self.seg,
lsn: self.lsn,
};
Summary::ser_into(&summary, &mut chapter)?;
let book = chapter.close()?;
// This flushes the underlying 'buf_writer'.
book.close()?;
// Note: Because we open the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.book here. The first read will have to re-open it.
let layer = ImageLayer {
path_or_conf: PathOrConf::Conf(self.conf),
timelineid: self.timelineid,
tenantid: self.tenantid,
seg: self.seg,
lsn: self.lsn,
inner: Mutex::new(ImageLayerInner {
book: None,
image_type,
}),
};
trace!("created image layer {}", layer.path().display());
Ok(layer)
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,468 @@
///
/// IntervalTree is data structure for holding intervals. It is generic
/// to make unit testing possible, but the only real user of it is the layer map,
///
/// It's inspired by the "segment tree" or a "statistic tree" as described in
/// https://en.wikipedia.org/wiki/Segment_tree. However, we use a B-tree to hold
/// the points instead of a binary tree. This is called an "interval tree" instead
/// of "segment tree" because the term "segment" is already using Zenith to mean
/// something else. To add to the confusion, there is another data structure known
/// as "interval tree" out there (see https://en.wikipedia.org/wiki/Interval_tree),
/// for storing intervals, but this isn't that.
///
/// The basic idea is to have a B-tree of "interesting Points". At each Point,
/// there is a list of intervals that contain the point. The Points are formed
/// from the start bounds of each interval; there is a Point for each distinct
/// start bound.
///
/// Operations:
///
/// To find intervals that contain a given point, you search the b-tree to find
/// the nearest Point <= search key. Then you just return the list of intervals.
///
/// To insert an interval, find the Point with start key equal to the inserted item.
/// If the Point doesn't exist yet, create it, by copying all the items from the
/// previous Point that cover the new Point. Then walk right, inserting the new
/// interval to all the Points that are contained by the new interval (including the
/// newly created Point).
///
/// To remove an interval, you scan the tree for all the Points that are contained by
/// the removed interval, and remove it from the list in each Point.
///
/// Requirements and assumptions:
///
/// - Can store overlapping items
/// - But there are not many overlapping items
/// - The interval bounds don't change after it is added to the tree
/// - Intervals are uniquely identified by pointer equality. You must not be insert the
/// same interval object twice, and `remove` uses pointer equality to remove the right
/// interval. It is OK to have two intervals with the same bounds, however.
///
use std::collections::BTreeMap;
use std::fmt::Debug;
use std::ops::Range;
use std::sync::Arc;
pub struct IntervalTree<I: ?Sized>
where
I: IntervalItem,
{
points: BTreeMap<I::Key, Point<I>>,
}
struct Point<I: ?Sized> {
/// All intervals that contain this point, in no particular order.
///
/// We assume that there aren't a lot of overlappingg intervals, so that this vector
/// never grows very large. If that assumption doesn't hold, we could keep this ordered
/// by the end bound, to speed up `search`. But as long as there are only a few elements,
/// a linear search is OK.
elements: Vec<Arc<I>>,
}
/// Abstraction for an interval that can be stored in the tree
///
/// The start bound is inclusive and the end bound is exclusive. End must be greater
/// than start.
pub trait IntervalItem {
type Key: Ord + Copy + Debug + Sized;
fn start_key(&self) -> Self::Key;
fn end_key(&self) -> Self::Key;
fn bounds(&self) -> Range<Self::Key> {
self.start_key()..self.end_key()
}
}
impl<I: ?Sized> IntervalTree<I>
where
I: IntervalItem,
{
/// Return an element that contains 'key', or precedes it.
///
/// If there are multiple candidates, returns the one with the highest 'end' key.
pub fn search(&self, key: I::Key) -> Option<Arc<I>> {
// Find the greatest point that precedes or is equal to the search key. If there is
// none, returns None.
let (_, p) = self.points.range(..=key).next_back()?;
// Find the element with the highest end key at this point
let highest_item = p
.elements
.iter()
.reduce(|a, b| {
// starting with Rust 1.53, could use `std::cmp::min_by_key` here
if a.end_key() > b.end_key() {
a
} else {
b
}
})
.unwrap();
Some(Arc::clone(highest_item))
}
/// Iterate over all items with start bound >= 'key'
pub fn iter_newer(&self, key: I::Key) -> IntervalIter<I> {
IntervalIter {
point_iter: self.points.range(key..),
elem_iter: None,
}
}
/// Iterate over all items
pub fn iter(&self) -> IntervalIter<I> {
IntervalIter {
point_iter: self.points.range(..),
elem_iter: None,
}
}
pub fn insert(&mut self, item: Arc<I>) {
let start_key = item.start_key();
let end_key = item.end_key();
assert!(start_key < end_key);
let bounds = start_key..end_key;
// Find the starting point and walk forward from there
let mut found_start_point = false;
let iter = self.points.range_mut(bounds);
for (point_key, point) in iter {
if *point_key == start_key {
found_start_point = true;
// It is an error to insert the same item to the tree twice.
assert!(
!point.elements.iter().any(|x| Arc::ptr_eq(x, &item)),
"interval is already in the tree"
);
}
point.elements.push(Arc::clone(&item));
}
if !found_start_point {
// Create a new Point for the starting point
// Look at the previous point, and copy over elements that overlap with this
// new point
let mut new_elements: Vec<Arc<I>> = Vec::new();
if let Some((_, prev_point)) = self.points.range(..start_key).next_back() {
let overlapping_prev_elements = prev_point
.elements
.iter()
.filter(|x| x.bounds().contains(&start_key))
.cloned();
new_elements.extend(overlapping_prev_elements);
}
new_elements.push(item);
let new_point = Point {
elements: new_elements,
};
self.points.insert(start_key, new_point);
}
}
pub fn remove(&mut self, item: &Arc<I>) {
// range search points
let start_key = item.start_key();
let end_key = item.end_key();
let bounds = start_key..end_key;
let mut points_to_remove: Vec<I::Key> = Vec::new();
let mut found_start_point = false;
for (point_key, point) in self.points.range_mut(bounds) {
if *point_key == start_key {
found_start_point = true;
}
let len_before = point.elements.len();
point.elements.retain(|other| !Arc::ptr_eq(other, item));
let len_after = point.elements.len();
assert_eq!(len_after + 1, len_before);
if len_after == 0 {
points_to_remove.push(*point_key);
}
}
assert!(found_start_point);
for k in points_to_remove {
self.points.remove(&k).unwrap();
}
}
}
pub struct IntervalIter<'a, I: ?Sized>
where
I: IntervalItem,
{
point_iter: std::collections::btree_map::Range<'a, I::Key, Point<I>>,
elem_iter: Option<(I::Key, std::slice::Iter<'a, Arc<I>>)>,
}
impl<'a, I> Iterator for IntervalIter<'a, I>
where
I: IntervalItem + ?Sized,
{
type Item = Arc<I>;
fn next(&mut self) -> Option<Self::Item> {
// Iterate over all elements in all the points in 'point_iter'. To avoid
// returning the same element twice, we only return each element at its
// starting point.
loop {
// Return next remaining element from the current point
if let Some((point_key, elem_iter)) = &mut self.elem_iter {
for elem in elem_iter {
if elem.start_key() == *point_key {
return Some(Arc::clone(elem));
}
}
}
// No more elements at this point. Move to next point.
if let Some((point_key, point)) = self.point_iter.next() {
self.elem_iter = Some((*point_key, point.elements.iter()));
continue;
} else {
// No more points, all done
return None;
}
}
}
}
impl<I: ?Sized> Default for IntervalTree<I>
where
I: IntervalItem,
{
fn default() -> Self {
IntervalTree {
points: BTreeMap::new(),
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::fmt;
#[derive(Debug)]
struct MockItem {
start_key: u32,
end_key: u32,
val: String,
}
impl IntervalItem for MockItem {
type Key = u32;
fn start_key(&self) -> u32 {
self.start_key
}
fn end_key(&self) -> u32 {
self.end_key
}
}
impl MockItem {
fn new(start_key: u32, end_key: u32) -> Self {
MockItem {
start_key,
end_key,
val: format!("{}-{}", start_key, end_key),
}
}
fn new_str(start_key: u32, end_key: u32, val: &str) -> Self {
MockItem {
start_key,
end_key,
val: format!("{}-{}: {}", start_key, end_key, val),
}
}
}
impl fmt::Display for MockItem {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{}", self.val)
}
}
#[rustfmt::skip]
fn assert_search(
tree: &IntervalTree<MockItem>,
key: u32,
expected: &[&str],
) -> Option<Arc<MockItem>> {
if let Some(v) = tree.search(key) {
let vstr = v.to_string();
assert!(!expected.is_empty(), "search with {} returned {}, expected None", key, v);
assert!(
expected.contains(&vstr.as_str()),
"search with {} returned {}, expected one of: {:?}",
key, v, expected,
);
Some(v)
} else {
assert!(
expected.is_empty(),
"search with {} returned None, expected one of {:?}",
key, expected
);
None
}
}
fn assert_contents(tree: &IntervalTree<MockItem>, expected: &[&str]) {
let mut contents: Vec<String> = tree.iter().map(|e| e.to_string()).collect();
contents.sort();
assert_eq!(contents, expected);
}
fn dump_tree(tree: &IntervalTree<MockItem>) {
for (point_key, point) in tree.points.iter() {
print!("{}:", point_key);
for e in point.elements.iter() {
print!(" {}", e);
}
println!();
}
}
#[test]
fn test_interval_tree_simple() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Simple, non-overlapping ranges.
tree.insert(Arc::new(MockItem::new(10, 11)));
tree.insert(Arc::new(MockItem::new(11, 12)));
tree.insert(Arc::new(MockItem::new(12, 13)));
tree.insert(Arc::new(MockItem::new(18, 19)));
tree.insert(Arc::new(MockItem::new(17, 18)));
tree.insert(Arc::new(MockItem::new(15, 16)));
assert_search(&tree, 9, &[]);
assert_search(&tree, 10, &["10-11"]);
assert_search(&tree, 11, &["11-12"]);
assert_search(&tree, 12, &["12-13"]);
assert_search(&tree, 13, &["12-13"]);
assert_search(&tree, 14, &["12-13"]);
assert_search(&tree, 15, &["15-16"]);
assert_search(&tree, 16, &["15-16"]);
assert_search(&tree, 17, &["17-18"]);
assert_search(&tree, 18, &["18-19"]);
assert_search(&tree, 19, &["18-19"]);
assert_search(&tree, 20, &["18-19"]);
// remove a few entries and search around them again
tree.remove(&assert_search(&tree, 10, &["10-11"]).unwrap()); // first entry
tree.remove(&assert_search(&tree, 12, &["12-13"]).unwrap()); // entry in the middle
tree.remove(&assert_search(&tree, 18, &["18-19"]).unwrap()); // last entry
assert_search(&tree, 9, &[]);
assert_search(&tree, 10, &[]);
assert_search(&tree, 11, &["11-12"]);
assert_search(&tree, 12, &["11-12"]);
assert_search(&tree, 14, &["11-12"]);
assert_search(&tree, 15, &["15-16"]);
assert_search(&tree, 17, &["17-18"]);
assert_search(&tree, 18, &["17-18"]);
}
#[test]
fn test_interval_tree_overlap() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Overlapping items
tree.insert(Arc::new(MockItem::new(22, 24)));
tree.insert(Arc::new(MockItem::new(23, 25)));
let x24_26 = Arc::new(MockItem::new(24, 26));
tree.insert(Arc::clone(&x24_26));
let x26_28 = Arc::new(MockItem::new(26, 28));
tree.insert(Arc::clone(&x26_28));
tree.insert(Arc::new(MockItem::new(25, 27)));
assert_search(&tree, 22, &["22-24"]);
assert_search(&tree, 23, &["22-24", "23-25"]);
assert_search(&tree, 24, &["23-25", "24-26"]);
assert_search(&tree, 25, &["24-26", "25-27"]);
assert_search(&tree, 26, &["25-27", "26-28"]);
assert_search(&tree, 27, &["26-28"]);
assert_search(&tree, 28, &["26-28"]);
assert_search(&tree, 29, &["26-28"]);
tree.remove(&x24_26);
tree.remove(&x26_28);
assert_search(&tree, 23, &["22-24", "23-25"]);
assert_search(&tree, 24, &["23-25"]);
assert_search(&tree, 25, &["25-27"]);
assert_search(&tree, 26, &["25-27"]);
assert_search(&tree, 27, &["25-27"]);
assert_search(&tree, 28, &["25-27"]);
assert_search(&tree, 29, &["25-27"]);
}
#[test]
fn test_interval_tree_nested() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Items containing other items
tree.insert(Arc::new(MockItem::new(31, 39)));
tree.insert(Arc::new(MockItem::new(32, 34)));
tree.insert(Arc::new(MockItem::new(33, 35)));
tree.insert(Arc::new(MockItem::new(30, 40)));
assert_search(&tree, 30, &["30-40"]);
assert_search(&tree, 31, &["30-40", "31-39"]);
assert_search(&tree, 32, &["30-40", "32-34", "31-39"]);
assert_search(&tree, 33, &["30-40", "32-34", "33-35", "31-39"]);
assert_search(&tree, 34, &["30-40", "33-35", "31-39"]);
assert_search(&tree, 35, &["30-40", "31-39"]);
assert_search(&tree, 36, &["30-40", "31-39"]);
assert_search(&tree, 37, &["30-40", "31-39"]);
assert_search(&tree, 38, &["30-40", "31-39"]);
assert_search(&tree, 39, &["30-40"]);
assert_search(&tree, 40, &["30-40"]);
assert_search(&tree, 41, &["30-40"]);
}
#[test]
fn test_interval_tree_duplicates() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Duplicate keys
let item_a = Arc::new(MockItem::new_str(55, 56, "a"));
tree.insert(Arc::clone(&item_a));
let item_b = Arc::new(MockItem::new_str(55, 56, "b"));
tree.insert(Arc::clone(&item_b));
let item_c = Arc::new(MockItem::new_str(55, 56, "c"));
tree.insert(Arc::clone(&item_c));
let item_d = Arc::new(MockItem::new_str(54, 56, "d"));
tree.insert(Arc::clone(&item_d));
let item_e = Arc::new(MockItem::new_str(55, 57, "e"));
tree.insert(Arc::clone(&item_e));
dump_tree(&tree);
assert_search(
&tree,
55,
&["55-56: a", "55-56: b", "55-56: c", "54-56: d", "55-57: e"],
);
tree.remove(&item_b);
dump_tree(&tree);
assert_contents(&tree, &["54-56: d", "55-56: a", "55-56: c", "55-57: e"]);
tree.remove(&item_d);
dump_tree(&tree);
assert_contents(&tree, &["55-56: a", "55-56: c", "55-57: e"]);
}
#[test]
#[should_panic]
fn test_interval_tree_insert_twice() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Inserting the same item twice is not cool
let item = Arc::new(MockItem::new(1, 2));
tree.insert(Arc::clone(&item));
tree.insert(Arc::clone(&item)); // fails assertion
}
}

View File

@@ -9,17 +9,20 @@
//! new image and delta layers and corresponding files are written to disk.
//!
use crate::layered_repository::interval_tree::{IntervalItem, IntervalIter, IntervalTree};
use crate::layered_repository::storage_layer::{Layer, SegmentTag};
use crate::layered_repository::InMemoryLayer;
use crate::relish::*;
use anyhow::Result;
use lazy_static::lazy_static;
use std::cmp::Ordering;
use std::collections::{BTreeMap, BinaryHeap, HashMap};
use std::collections::{BinaryHeap, HashMap};
use std::sync::Arc;
use zenith_metrics::{register_int_gauge, IntGauge};
use zenith_utils::lsn::Lsn;
use super::global_layer_map::{LayerId, GLOBAL_LAYER_MAP};
lazy_static! {
static ref NUM_INMEMORY_LAYERS: IntGauge =
register_int_gauge!("pageserver_inmemory_layers", "Number of layers in memory")
@@ -37,7 +40,7 @@ pub struct LayerMap {
/// All the layers keyed by segment tag
segs: HashMap<SegmentTag, SegEntry>,
/// All in-memory layers, ordered by 'oldest_pending_lsn' and generation
/// All in-memory layers, ordered by 'oldest_lsn' and generation
/// of each layer. This allows easy access to the in-memory layer that
/// contains the oldest WAL record.
open_layers: BinaryHeap<OpenLayerEntry>,
@@ -67,7 +70,9 @@ impl LayerMap {
pub fn get_open(&self, tag: &SegmentTag) -> Option<Arc<InMemoryLayer>> {
let segentry = self.segs.get(tag)?;
segentry.open.as_ref().map(Arc::clone)
segentry
.open_layer_id
.and_then(|layer_id| GLOBAL_LAYER_MAP.read().unwrap().get(&layer_id))
}
///
@@ -76,12 +81,19 @@ impl LayerMap {
pub fn insert_open(&mut self, layer: Arc<InMemoryLayer>) {
let segentry = self.segs.entry(layer.get_seg_tag()).or_default();
segentry.insert_open(Arc::clone(&layer));
let layer_id = segentry.update_open(Arc::clone(&layer));
let oldest_lsn = layer.get_oldest_lsn();
// After a crash and restart, 'oldest_lsn' of the oldest in-memory
// layer becomes the WAL streaming starting point, so it better not point
// in the middle of a WAL record.
assert!(oldest_lsn.is_aligned());
// Also add it to the binary heap
let open_layer_entry = OpenLayerEntry {
oldest_pending_lsn: layer.get_oldest_pending_lsn(),
layer,
oldest_lsn: layer.get_oldest_lsn(),
layer_id,
generation: self.current_generation,
};
self.open_layers.push(open_layer_entry);
@@ -89,21 +101,35 @@ impl LayerMap {
NUM_INMEMORY_LAYERS.inc();
}
/// Remove the oldest in-memory layer
pub fn pop_oldest_open(&mut self) {
// Pop it from the binary heap
let oldest_entry = self.open_layers.pop().unwrap();
let segtag = oldest_entry.layer.get_seg_tag();
/// Remove an open in-memory layer
pub fn remove_open(&mut self, layer_id: LayerId) {
// Note: we don't try to remove the entry from the binary heap.
// It will be removed lazily by peek_oldest_open() when it's made it to
// the top of the heap.
// Also remove it from the SegEntry of this segment
let mut segentry = self.segs.get_mut(&segtag).unwrap();
assert!(Arc::ptr_eq(
segentry.open.as_ref().unwrap(),
&oldest_entry.layer
));
segentry.open = None;
let layer_opt = {
let mut global_map = GLOBAL_LAYER_MAP.write().unwrap();
let layer_opt = global_map.get(&layer_id);
global_map.remove(&layer_id);
// TODO it's bad that a ref can still exist after being evicted from cache
layer_opt
};
NUM_INMEMORY_LAYERS.dec();
if let Some(layer) = layer_opt {
let mut segentry = self.segs.get_mut(&layer.get_seg_tag()).unwrap();
if segentry.open_layer_id == Some(layer_id) {
// Also remove it from the SegEntry of this segment
segentry.open_layer_id = None;
} else {
// We could have already updated segentry.open for
// dropped (non-writeable) layer. This is fine.
assert!(!layer.is_writeable());
assert!(layer.is_dropped());
}
NUM_INMEMORY_LAYERS.dec();
}
}
///
@@ -121,12 +147,11 @@ impl LayerMap {
///
/// This should be called when the corresponding file on disk has been deleted.
///
pub fn remove_historic(&mut self, layer: &dyn Layer) {
pub fn remove_historic(&mut self, layer: Arc<dyn Layer>) {
let tag = layer.get_seg_tag();
let start_lsn = layer.get_start_lsn();
if let Some(segentry) = self.segs.get_mut(&tag) {
segentry.historic.remove(&start_lsn);
segentry.historic.remove(&layer);
}
NUM_ONDISK_LAYERS.dec();
}
@@ -144,7 +169,7 @@ impl LayerMap {
if (request_rel.spcnode == 0 || reltag.spcnode == request_rel.spcnode)
&& (request_rel.dbnode == 0 || reltag.dbnode == request_rel.dbnode)
{
if let Some(exists) = segentry.exists_at_lsn(lsn) {
if let Some(exists) = segentry.exists_at_lsn(lsn)? {
rels.insert(seg.rel, exists);
}
}
@@ -152,7 +177,7 @@ impl LayerMap {
}
_ => {
if tag == None {
if let Some(exists) = segentry.exists_at_lsn(lsn) {
if let Some(exists) = segentry.exists_at_lsn(lsn)? {
rels.insert(seg.rel, exists);
}
}
@@ -166,19 +191,46 @@ impl LayerMap {
///
/// This is used for garbage collection, to determine if an old layer can
/// be deleted.
pub fn newer_image_layer_exists(&self, seg: SegmentTag, lsn: Lsn) -> bool {
/// We ignore segments newer than disk_consistent_lsn because they will be removed at restart
pub fn newer_image_layer_exists(
&self,
seg: SegmentTag,
lsn: Lsn,
disk_consistent_lsn: Lsn,
) -> bool {
if let Some(segentry) = self.segs.get(&seg) {
segentry.newer_image_layer_exists(lsn)
segentry.newer_image_layer_exists(lsn, disk_consistent_lsn)
} else {
false
}
}
/// Is there any layer for given segment that is alive at the lsn?
///
/// This is a public wrapper for SegEntry fucntion,
/// used for garbage collection, to determine if some alive layer
/// exists at the lsn. If so, we shouldn't delete a newer dropped layer
/// to avoid incorrectly making it visible.
pub fn layer_exists_at_lsn(&self, seg: SegmentTag, lsn: Lsn) -> Result<bool> {
Ok(if let Some(segentry) = self.segs.get(&seg) {
segentry.exists_at_lsn(lsn)?.unwrap_or(false)
} else {
false
})
}
/// Return the oldest in-memory layer, along with its generation number.
pub fn peek_oldest_open(&self) -> Option<(Arc<InMemoryLayer>, u64)> {
self.open_layers
.peek()
.map(|oldest_entry| (Arc::clone(&oldest_entry.layer), oldest_entry.generation))
pub fn peek_oldest_open(&mut self) -> Option<(LayerId, Arc<InMemoryLayer>, u64)> {
let global_map = GLOBAL_LAYER_MAP.read().unwrap();
while let Some(oldest_entry) = self.open_layers.peek() {
if let Some(layer) = global_map.get(&oldest_entry.layer_id) {
return Some((oldest_entry.layer_id, layer, oldest_entry.generation));
} else {
self.open_layers.pop();
}
}
None
}
/// Increment the generation number used to stamp open in-memory layers. Layers
@@ -191,7 +243,7 @@ impl LayerMap {
pub fn iter_historic_layers(&self) -> HistoricLayerIter {
HistoricLayerIter {
segiter: self.segs.iter(),
seg_iter: self.segs.iter(),
iter: None,
}
}
@@ -201,11 +253,15 @@ impl LayerMap {
pub fn dump(&self) -> Result<()> {
println!("Begin dump LayerMap");
for (seg, segentry) in self.segs.iter() {
if let Some(open) = &segentry.open {
open.dump()?;
if let Some(open) = &segentry.open_layer_id {
if let Some(layer) = GLOBAL_LAYER_MAP.read().unwrap().get(open) {
layer.dump()?;
} else {
println!("layer not found in global map");
}
}
for (_, layer) in segentry.historic.iter() {
for layer in segentry.historic.iter() {
layer.dump()?;
}
}
@@ -214,99 +270,105 @@ impl LayerMap {
}
}
impl IntervalItem for dyn Layer {
type Key = Lsn;
fn start_key(&self) -> Lsn {
self.get_start_lsn()
}
fn end_key(&self) -> Lsn {
self.get_end_lsn()
}
}
///
/// Per-segment entry in the LayerMap::segs hash map. Holds all the layers
/// associated with the segment.
///
/// The last layer that is open for writes is always an InMemoryLayer,
/// and is kept in a separate field, because there can be only one for
/// each segment. The older layers, stored on disk, are kept in a
/// BTreeMap keyed by the layer's start LSN.
/// each segment. The older layers, stored on disk, are kept in an
/// IntervalTree.
#[derive(Default)]
struct SegEntry {
pub open: Option<Arc<InMemoryLayer>>,
pub historic: BTreeMap<Lsn, Arc<dyn Layer>>,
open_layer_id: Option<LayerId>,
historic: IntervalTree<dyn Layer>,
}
impl SegEntry {
/// Does the segment exist at given LSN?
/// Return None if object is not found in this SegEntry.
fn exists_at_lsn(&self, lsn: Lsn) -> Option<bool> {
if let Some(layer) = &self.open {
if layer.get_start_lsn() <= lsn && lsn <= layer.get_end_lsn() {
let exists = layer.get_seg_exists(lsn).ok()?;
return Some(exists);
}
} else if let Some((_, layer)) = self.historic.range(..=lsn).next_back() {
let exists = layer.get_seg_exists(lsn).ok()?;
return Some(exists);
fn exists_at_lsn(&self, lsn: Lsn) -> Result<Option<bool>> {
if let Some(layer) = self.get(lsn) {
Ok(Some(layer.get_seg_exists(lsn)?))
} else {
Ok(None)
}
None
}
pub fn get(&self, lsn: Lsn) -> Option<Arc<dyn Layer>> {
if let Some(open) = &self.open {
if open.get_start_lsn() <= lsn {
let x: Arc<dyn Layer> = Arc::clone(open) as _;
return Some(x);
if let Some(open_layer_id) = &self.open_layer_id {
let open_layer = GLOBAL_LAYER_MAP.read().unwrap().get(open_layer_id)?;
if open_layer.get_start_lsn() <= lsn {
return Some(open_layer);
}
}
if let Some((_start_lsn, layer)) = self.historic.range(..=lsn).next_back() {
Some(Arc::clone(layer))
} else {
None
}
self.historic.search(lsn)
}
pub fn newer_image_layer_exists(&self, lsn: Lsn) -> bool {
pub fn newer_image_layer_exists(&self, lsn: Lsn, disk_consistent_lsn: Lsn) -> bool {
// We only check on-disk layers, because
// in-memory layers are not durable
for (_newer_lsn, layer) in self.historic.range(lsn..) {
// Ignore incremental layers.
if layer.is_incremental() {
continue;
}
if layer.get_end_lsn() > lsn {
return true;
} else {
continue;
}
}
false
// The end-LSN is exclusive, while disk_consistent_lsn is
// inclusive. For example, if disk_consistent_lsn is 100, it is
// OK for a delta layer to have end LSN 101, but if the end LSN
// is 102, then it might not have been fully flushed to disk
// before crash.
self.historic
.iter_newer(lsn)
.any(|layer| !layer.is_incremental() && layer.get_end_lsn() <= disk_consistent_lsn + 1)
}
pub fn insert_open(&mut self, layer: Arc<InMemoryLayer>) {
assert!(self.open.is_none());
self.open = Some(layer);
// Set new open layer for a SegEntry.
// It's ok to rewrite previous open layer,
// but only if it is not writeable anymore.
pub fn update_open(&mut self, layer: Arc<InMemoryLayer>) -> LayerId {
if let Some(prev_open_layer_id) = &self.open_layer_id {
if let Some(prev_open_layer) = GLOBAL_LAYER_MAP.read().unwrap().get(prev_open_layer_id)
{
assert!(!prev_open_layer.is_writeable());
}
}
let open_layer_id = GLOBAL_LAYER_MAP.write().unwrap().insert(layer);
self.open_layer_id = Some(open_layer_id);
open_layer_id
}
pub fn insert_historic(&mut self, layer: Arc<dyn Layer>) {
let start_lsn = layer.get_start_lsn();
self.historic.insert(start_lsn, layer);
self.historic.insert(layer);
}
}
/// Entry held in LayerMap::open_layers, with boilerplate comparison routines
/// to implement a min-heap ordered by 'oldest_pending_lsn' and 'generation'
/// to implement a min-heap ordered by 'oldest_lsn' and 'generation'
///
/// The generation number associated with each entry can be used to distinguish
/// recently-added entries (i.e after last call to increment_generation()) from older
/// entries with the same 'oldest_pending_lsn'.
/// entries with the same 'oldest_lsn'.
struct OpenLayerEntry {
pub oldest_pending_lsn: Lsn, // copy of layer.get_oldest_pending_lsn()
pub generation: u64,
pub layer: Arc<InMemoryLayer>,
oldest_lsn: Lsn, // copy of layer.get_oldest_lsn()
generation: u64,
layer_id: LayerId,
}
impl Ord for OpenLayerEntry {
fn cmp(&self, other: &Self) -> Ordering {
// BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here
// to get that. Entries with identical oldest_pending_lsn are ordered by generation
// to get that. Entries with identical oldest_lsn are ordered by generation
other
.oldest_pending_lsn
.cmp(&self.oldest_pending_lsn)
.oldest_lsn
.cmp(&self.oldest_lsn)
.then_with(|| other.generation.cmp(&self.generation))
}
}
@@ -324,8 +386,8 @@ impl Eq for OpenLayerEntry {}
/// Iterator returned by LayerMap::iter_historic_layers()
pub struct HistoricLayerIter<'a> {
segiter: std::collections::hash_map::Iter<'a, SegmentTag, SegEntry>,
iter: Option<std::collections::btree_map::Iter<'a, Lsn, Arc<dyn Layer>>>,
seg_iter: std::collections::hash_map::Iter<'a, SegmentTag, SegEntry>,
iter: Option<IntervalIter<'a, dyn Layer>>,
}
impl<'a> Iterator for HistoricLayerIter<'a> {
@@ -335,11 +397,11 @@ impl<'a> Iterator for HistoricLayerIter<'a> {
loop {
if let Some(x) = &mut self.iter {
if let Some(x) = x.next() {
return Some(Arc::clone(&*x.1));
return Some(Arc::clone(&x));
}
}
if let Some(seg) = self.segiter.next() {
self.iter = Some(seg.1.historic.iter());
if let Some((_tag, segentry)) = self.seg_iter.next() {
self.iter = Some(segentry.historic.iter());
continue;
} else {
return None;
@@ -351,7 +413,7 @@ impl<'a> Iterator for HistoricLayerIter<'a> {
#[cfg(test)]
mod tests {
use super::*;
use crate::PageServerConf;
use crate::config::PageServerConf;
use std::str::FromStr;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
@@ -363,24 +425,31 @@ mod tests {
forknum: 0,
});
lazy_static! {
static ref DUMMY_TIMELINEID: ZTimelineId =
ZTimelineId::from_str("00000000000000000000000000000000").unwrap();
static ref DUMMY_TENANTID: ZTenantId =
ZTenantId::from_str("00000000000000000000000000000000").unwrap();
}
/// Construct a dummy InMemoryLayer for testing
fn dummy_inmem_layer(
conf: &'static PageServerConf,
segno: u32,
start_lsn: Lsn,
oldest_pending_lsn: Lsn,
oldest_lsn: Lsn,
) -> Arc<InMemoryLayer> {
Arc::new(
InMemoryLayer::create(
conf,
ZTimelineId::from_str("00000000000000000000000000000000").unwrap(),
ZTenantId::from_str("00000000000000000000000000000000").unwrap(),
*DUMMY_TIMELINEID,
*DUMMY_TENANTID,
SegmentTag {
rel: TESTREL_A,
segno,
},
start_lsn,
oldest_pending_lsn,
oldest_lsn,
)
.unwrap(),
)
@@ -390,34 +459,35 @@ mod tests {
fn test_open_layers() -> Result<()> {
let conf = PageServerConf::dummy_conf(PageServerConf::test_repo_dir("dummy_inmem_layer"));
let conf = Box::leak(Box::new(conf));
std::fs::create_dir_all(conf.timeline_path(&DUMMY_TIMELINEID, &DUMMY_TENANTID))?;
let mut layers = LayerMap::default();
let gen1 = layers.increment_generation();
layers.insert_open(dummy_inmem_layer(conf, 0, Lsn(100), Lsn(100)));
layers.insert_open(dummy_inmem_layer(conf, 1, Lsn(100), Lsn(200)));
layers.insert_open(dummy_inmem_layer(conf, 2, Lsn(100), Lsn(120)));
layers.insert_open(dummy_inmem_layer(conf, 3, Lsn(100), Lsn(110)));
layers.insert_open(dummy_inmem_layer(conf, 0, Lsn(0x100), Lsn(0x100)));
layers.insert_open(dummy_inmem_layer(conf, 1, Lsn(0x100), Lsn(0x200)));
layers.insert_open(dummy_inmem_layer(conf, 2, Lsn(0x100), Lsn(0x120)));
layers.insert_open(dummy_inmem_layer(conf, 3, Lsn(0x100), Lsn(0x110)));
let gen2 = layers.increment_generation();
layers.insert_open(dummy_inmem_layer(conf, 4, Lsn(100), Lsn(110)));
layers.insert_open(dummy_inmem_layer(conf, 5, Lsn(100), Lsn(100)));
layers.insert_open(dummy_inmem_layer(conf, 4, Lsn(0x100), Lsn(0x110)));
layers.insert_open(dummy_inmem_layer(conf, 5, Lsn(0x100), Lsn(0x100)));
// A helper function (closure) to pop the next oldest open entry from the layer map,
// and assert that it is what we'd expect
let mut assert_pop_layer = |expected_segno: u32, expected_generation: u64| {
let (l, generation) = layers.peek_oldest_open().unwrap();
let (layer_id, l, generation) = layers.peek_oldest_open().unwrap();
assert!(l.get_seg_tag().segno == expected_segno);
assert!(generation == expected_generation);
layers.pop_oldest_open();
layers.remove_open(layer_id);
};
assert_pop_layer(0, gen1); // 100
assert_pop_layer(5, gen2); // 100
assert_pop_layer(3, gen1); // 110
assert_pop_layer(4, gen2); // 110
assert_pop_layer(2, gen1); // 120
assert_pop_layer(1, gen1); // 200
assert_pop_layer(0, gen1); // 0x100
assert_pop_layer(5, gen2); // 0x100
assert_pop_layer(3, gen1); // 0x110
assert_pop_layer(4, gen2); // 0x110
assert_pop_layer(2, gen1); // 0x120
assert_pop_layer(1, gen1); // 0x200
Ok(())
}

View File

@@ -0,0 +1,228 @@
//! Every image of a certain timeline from [`crate::layered_repository::LayeredRepository`]
//! has a metadata that needs to be stored persistently.
//!
//! Later, the file gets is used in [`crate::remote_storage::storage_sync`] as a part of
//! external storage import and export operations.
//!
//! The module contains all structs and related helper methods related to timeline metadata.
use std::{convert::TryInto, path::PathBuf};
use anyhow::ensure;
use zenith_utils::{
bin_ser::BeSer,
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
use crate::config::PageServerConf;
// Taken from PG_CONTROL_MAX_SAFE_SIZE
const METADATA_MAX_SAFE_SIZE: usize = 512;
const METADATA_CHECKSUM_SIZE: usize = std::mem::size_of::<u32>();
const METADATA_MAX_DATA_SIZE: usize = METADATA_MAX_SAFE_SIZE - METADATA_CHECKSUM_SIZE;
/// The name of the metadata file pageserver creates per timeline.
pub const METADATA_FILE_NAME: &str = "metadata";
/// Metadata stored on disk for each timeline
///
/// The fields correspond to the values we hold in memory, in LayeredTimeline.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub struct TimelineMetadata {
disk_consistent_lsn: Lsn,
// This is only set if we know it. We track it in memory when the page
// server is running, but we only track the value corresponding to
// 'last_record_lsn', not 'disk_consistent_lsn' which can lag behind by a
// lot. We only store it in the metadata file when we flush *all* the
// in-memory data so that 'last_record_lsn' is the same as
// 'disk_consistent_lsn'. That's OK, because after page server restart, as
// soon as we reprocess at least one record, we will have a valid
// 'prev_record_lsn' value in memory again. This is only really needed when
// doing a clean shutdown, so that there is no more WAL beyond
// 'disk_consistent_lsn'
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<ZTimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
}
/// Points to a place in pageserver's local directory,
/// where certain timeline's metadata file should be located.
pub fn metadata_path(
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
) -> PathBuf {
conf.timeline_path(&timelineid, &tenantid)
.join(METADATA_FILE_NAME)
}
impl TimelineMetadata {
pub fn new(
disk_consistent_lsn: Lsn,
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<ZTimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
) -> Self {
Self {
disk_consistent_lsn,
prev_record_lsn,
ancestor_timeline,
ancestor_lsn,
latest_gc_cutoff_lsn,
initdb_lsn,
}
}
pub fn from_bytes(metadata_bytes: &[u8]) -> anyhow::Result<Self> {
ensure!(
metadata_bytes.len() == METADATA_MAX_SAFE_SIZE,
"metadata bytes size is wrong"
);
let data = &metadata_bytes[..METADATA_MAX_DATA_SIZE];
let calculated_checksum = crc32c::crc32c(data);
let checksum_bytes: &[u8; METADATA_CHECKSUM_SIZE] =
metadata_bytes[METADATA_MAX_DATA_SIZE..].try_into()?;
let expected_checksum = u32::from_le_bytes(*checksum_bytes);
ensure!(
calculated_checksum == expected_checksum,
"metadata checksum mismatch"
);
let data = TimelineMetadata::from(serialize::DeTimelineMetadata::des_prefix(data)?);
assert!(data.disk_consistent_lsn.is_aligned());
Ok(data)
}
pub fn to_bytes(&self) -> anyhow::Result<Vec<u8>> {
let serializeable_metadata = serialize::SeTimelineMetadata::from(self);
let mut metadata_bytes = serialize::SeTimelineMetadata::ser(&serializeable_metadata)?;
assert!(metadata_bytes.len() <= METADATA_MAX_DATA_SIZE);
metadata_bytes.resize(METADATA_MAX_SAFE_SIZE, 0u8);
let checksum = crc32c::crc32c(&metadata_bytes[..METADATA_MAX_DATA_SIZE]);
metadata_bytes[METADATA_MAX_DATA_SIZE..].copy_from_slice(&u32::to_le_bytes(checksum));
Ok(metadata_bytes)
}
/// [`Lsn`] that corresponds to the corresponding timeline directory
/// contents, stored locally in the pageserver workdir.
pub fn disk_consistent_lsn(&self) -> Lsn {
self.disk_consistent_lsn
}
pub fn prev_record_lsn(&self) -> Option<Lsn> {
self.prev_record_lsn
}
pub fn ancestor_timeline(&self) -> Option<ZTimelineId> {
self.ancestor_timeline
}
pub fn ancestor_lsn(&self) -> Lsn {
self.ancestor_lsn
}
pub fn latest_gc_cutoff_lsn(&self) -> Lsn {
self.latest_gc_cutoff_lsn
}
pub fn initdb_lsn(&self) -> Lsn {
self.initdb_lsn
}
}
/// This module is for direct conversion of metadata to bytes and back.
/// For a certain metadata, besides the conversion a few verification steps has to
/// be done, so all serde derives are hidden from the user, to avoid accidental
/// verification-less metadata creation.
mod serialize {
use serde::{Deserialize, Serialize};
use zenith_utils::{lsn::Lsn, zid::ZTimelineId};
use super::TimelineMetadata;
#[derive(Serialize)]
pub(super) struct SeTimelineMetadata<'a> {
disk_consistent_lsn: &'a Lsn,
prev_record_lsn: &'a Option<Lsn>,
ancestor_timeline: &'a Option<ZTimelineId>,
ancestor_lsn: &'a Lsn,
latest_gc_cutoff_lsn: &'a Lsn,
initdb_lsn: &'a Lsn,
}
impl<'a> From<&'a TimelineMetadata> for SeTimelineMetadata<'a> {
fn from(other: &'a TimelineMetadata) -> Self {
Self {
disk_consistent_lsn: &other.disk_consistent_lsn,
prev_record_lsn: &other.prev_record_lsn,
ancestor_timeline: &other.ancestor_timeline,
ancestor_lsn: &other.ancestor_lsn,
latest_gc_cutoff_lsn: &other.latest_gc_cutoff_lsn,
initdb_lsn: &other.initdb_lsn,
}
}
}
#[derive(Deserialize)]
pub(super) struct DeTimelineMetadata {
disk_consistent_lsn: Lsn,
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<ZTimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
}
impl From<DeTimelineMetadata> for TimelineMetadata {
fn from(other: DeTimelineMetadata) -> Self {
Self {
disk_consistent_lsn: other.disk_consistent_lsn,
prev_record_lsn: other.prev_record_lsn,
ancestor_timeline: other.ancestor_timeline,
ancestor_lsn: other.ancestor_lsn,
latest_gc_cutoff_lsn: other.latest_gc_cutoff_lsn,
initdb_lsn: other.initdb_lsn,
}
}
}
}
#[cfg(test)]
mod tests {
use crate::repository::repo_harness::TIMELINE_ID;
use super::*;
#[test]
fn metadata_serializes_correctly() {
let original_metadata = TimelineMetadata {
disk_consistent_lsn: Lsn(0x200),
prev_record_lsn: Some(Lsn(0x100)),
ancestor_timeline: Some(TIMELINE_ID),
ancestor_lsn: Lsn(0),
latest_gc_cutoff_lsn: Lsn(0),
initdb_lsn: Lsn(0),
};
let metadata_bytes = original_metadata
.to_bytes()
.expect("Should serialize correct metadata to bytes");
let deserialized_metadata = TimelineMetadata::from_bytes(&metadata_bytes)
.expect("Should deserialize its own bytes");
assert_eq!(
deserialized_metadata, original_metadata,
"Metadata that was serialized to bytes and deserialized back should not change"
);
}
}

View File

@@ -0,0 +1,268 @@
//!
//! Data structure to ingest incoming WAL into an append-only file.
//!
//! - The file is considered temporary, and will be discarded on crash
//! - based on a B-tree
//!
use std::os::unix::fs::FileExt;
use std::{collections::HashMap, ops::RangeBounds, slice};
use anyhow::Result;
use std::cmp::min;
use std::io::Seek;
use zenith_utils::{lsn::Lsn, vec_map::VecMap};
use super::storage_layer::PageVersion;
use crate::layered_repository::ephemeral_file::EphemeralFile;
use zenith_utils::bin_ser::BeSer;
const EMPTY_SLICE: &[(Lsn, u64)] = &[];
pub struct PageVersions {
map: HashMap<u32, VecMap<Lsn, u64>>,
/// The PageVersion structs are stored in a serialized format in this file.
/// Each serialized PageVersion is preceded by a 'u32' length field.
/// The 'map' stores offsets into this file.
file: EphemeralFile,
}
impl PageVersions {
pub fn new(file: EphemeralFile) -> PageVersions {
PageVersions {
map: HashMap::new(),
file,
}
}
pub fn append_or_update_last(
&mut self,
blknum: u32,
lsn: Lsn,
page_version: PageVersion,
) -> Result<Option<u64>> {
// remember starting position
let pos = self.file.stream_position()?;
// make room for the 'length' field by writing zeros as a placeholder.
self.file.seek(std::io::SeekFrom::Start(pos + 4)).unwrap();
page_version.ser_into(&mut self.file).unwrap();
// write the 'length' field.
let len = self.file.stream_position()? - pos - 4;
let lenbuf = u32::to_ne_bytes(len as u32);
self.file.write_all_at(&lenbuf, pos)?;
let map = self.map.entry(blknum).or_insert_with(VecMap::default);
Ok(map.append_or_update_last(lsn, pos as u64).unwrap().0)
}
/// Get all [`PageVersion`]s in a block
fn get_block_slice(&self, blknum: u32) -> &[(Lsn, u64)] {
self.map
.get(&blknum)
.map(VecMap::as_slice)
.unwrap_or(EMPTY_SLICE)
}
/// Get a range of [`PageVersions`] in a block
pub fn get_block_lsn_range<R: RangeBounds<Lsn>>(&self, blknum: u32, range: R) -> &[(Lsn, u64)] {
self.map
.get(&blknum)
.map(|vec_map| vec_map.slice_range(range))
.unwrap_or(EMPTY_SLICE)
}
/// Iterate through [`PageVersion`]s in (block, lsn) order.
/// If a [`cutoff_lsn`] is set, only show versions with `lsn < cutoff_lsn`
pub fn ordered_page_version_iter(&self, cutoff_lsn: Option<Lsn>) -> OrderedPageVersionIter<'_> {
let mut ordered_blocks: Vec<u32> = self.map.keys().cloned().collect();
ordered_blocks.sort_unstable();
let slice = ordered_blocks
.first()
.map(|&blknum| self.get_block_slice(blknum))
.unwrap_or(EMPTY_SLICE);
OrderedPageVersionIter {
page_versions: self,
ordered_blocks,
cur_block_idx: 0,
cutoff_lsn,
cur_slice_iter: slice.iter(),
}
}
///
/// Read a page version.
///
pub fn read_pv(&self, off: u64) -> Result<PageVersion> {
let mut buf = Vec::new();
self.read_pv_bytes(off, &mut buf)?;
Ok(PageVersion::des(&buf)?)
}
///
/// Read a page version, as raw bytes, at the given offset. The bytes
/// are read into 'buf', which is expanded if necessary. Returns the
/// size of the page version.
///
pub fn read_pv_bytes(&self, off: u64, buf: &mut Vec<u8>) -> Result<usize> {
// read length
let mut lenbuf = [0u8; 4];
self.file.read_exact_at(&mut lenbuf, off)?;
let len = u32::from_ne_bytes(lenbuf) as usize;
// Resize the buffer to fit the data, if needed.
//
// We don't shrink the buffer if it's larger than necessary. That avoids
// repeatedly shrinking and expanding when you reuse the same buffer to
// read multiple page versions. Expanding a Vec requires initializing the
// new bytes, which is a waste of time because we're immediately overwriting
// it, but there's no way to avoid it without resorting to unsafe code.
if buf.len() < len {
buf.resize(len, 0);
}
self.file.read_exact_at(&mut buf[0..len], off + 4)?;
Ok(len)
}
}
pub struct PageVersionReader<'a> {
file: &'a EphemeralFile,
pos: u64,
end_pos: u64,
}
impl<'a> std::io::Read for PageVersionReader<'a> {
fn read(&mut self, buf: &mut [u8]) -> Result<usize, std::io::Error> {
let len = min(buf.len(), (self.end_pos - self.pos) as usize);
let n = self.file.read_at(&mut buf[..len], self.pos)?;
self.pos += n as u64;
Ok(n)
}
}
pub struct OrderedPageVersionIter<'a> {
page_versions: &'a PageVersions,
ordered_blocks: Vec<u32>,
cur_block_idx: usize,
cutoff_lsn: Option<Lsn>,
cur_slice_iter: slice::Iter<'a, (Lsn, u64)>,
}
impl OrderedPageVersionIter<'_> {
fn is_lsn_before_cutoff(&self, lsn: &Lsn) -> bool {
if let Some(cutoff_lsn) = self.cutoff_lsn.as_ref() {
lsn < cutoff_lsn
} else {
true
}
}
}
impl<'a> Iterator for OrderedPageVersionIter<'a> {
type Item = (u32, Lsn, u64);
fn next(&mut self) -> Option<Self::Item> {
loop {
if let Some((lsn, pos)) = self.cur_slice_iter.next() {
if self.is_lsn_before_cutoff(lsn) {
let blknum = self.ordered_blocks[self.cur_block_idx];
return Some((blknum, *lsn, *pos));
}
}
let next_block_idx = self.cur_block_idx + 1;
let blknum: u32 = *self.ordered_blocks.get(next_block_idx)?;
self.cur_block_idx = next_block_idx;
self.cur_slice_iter = self.page_versions.get_block_slice(blknum).iter();
}
}
}
#[cfg(test)]
mod tests {
use bytes::Bytes;
use super::*;
use crate::config::PageServerConf;
use std::fs;
use std::str::FromStr;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
fn repo_harness(test_name: &str) -> Result<(&'static PageServerConf, ZTenantId, ZTimelineId)> {
let repo_dir = PageServerConf::test_repo_dir(test_name);
let _ = fs::remove_dir_all(&repo_dir);
let conf = PageServerConf::dummy_conf(repo_dir);
// Make a static copy of the config. This can never be free'd, but that's
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
let tenantid = ZTenantId::from_str("11000000000000000000000000000000").unwrap();
let timelineid = ZTimelineId::from_str("22000000000000000000000000000000").unwrap();
fs::create_dir_all(conf.timeline_path(&timelineid, &tenantid))?;
Ok((conf, tenantid, timelineid))
}
#[test]
fn test_ordered_iter() -> Result<()> {
let (conf, tenantid, timelineid) = repo_harness("test_ordered_iter")?;
let file = EphemeralFile::create(conf, tenantid, timelineid)?;
let mut page_versions = PageVersions::new(file);
const BLOCKS: u32 = 1000;
const LSNS: u64 = 50;
let empty_page = Bytes::from_static(&[0u8; 8192]);
let empty_page_version = PageVersion::Page(empty_page);
for blknum in 0..BLOCKS {
for lsn in 0..LSNS {
let old = page_versions.append_or_update_last(
blknum,
Lsn(lsn),
empty_page_version.clone(),
)?;
assert!(old.is_none());
}
}
let mut iter = page_versions.ordered_page_version_iter(None);
for blknum in 0..BLOCKS {
for lsn in 0..LSNS {
let (actual_blknum, actual_lsn, _pv) = iter.next().unwrap();
assert_eq!(actual_blknum, blknum);
assert_eq!(Lsn(lsn), actual_lsn);
}
}
assert!(iter.next().is_none());
assert!(iter.next().is_none()); // should be robust against excessive next() calls
const CUTOFF_LSN: Lsn = Lsn(30);
let mut iter = page_versions.ordered_page_version_iter(Some(CUTOFF_LSN));
for blknum in 0..BLOCKS {
for lsn in 0..CUTOFF_LSN.0 {
let (actual_blknum, actual_lsn, _pv) = iter.next().unwrap();
assert_eq!(actual_blknum, blknum);
assert_eq!(Lsn(lsn), actual_lsn);
}
}
assert!(iter.next().is_none());
assert!(iter.next().is_none()); // should be robust against excessive next() calls
Ok(())
}
}

View File

@@ -0,0 +1,55 @@
use std::{
io,
path::{Path, PathBuf},
sync::atomic::{AtomicUsize, Ordering},
};
use crate::virtual_file::VirtualFile;
fn fsync_path(path: &Path) -> io::Result<()> {
let file = VirtualFile::open(path)?;
file.sync_all()
}
fn parallel_worker(paths: &[PathBuf], next_path_idx: &AtomicUsize) -> io::Result<()> {
while let Some(path) = paths.get(next_path_idx.fetch_add(1, Ordering::Relaxed)) {
fsync_path(path)?;
}
Ok(())
}
pub fn par_fsync(paths: &[PathBuf]) -> io::Result<()> {
const PARALLEL_PATH_THRESHOLD: usize = 1;
if paths.len() <= PARALLEL_PATH_THRESHOLD {
for path in paths {
fsync_path(path)?;
}
return Ok(());
}
/// Use at most this number of threads.
/// Increasing this limit will
/// - use more memory
/// - increase the cost of spawn/join latency
const MAX_NUM_THREADS: usize = 64;
let num_threads = paths.len().min(MAX_NUM_THREADS);
let next_path_idx = AtomicUsize::new(0);
crossbeam_utils::thread::scope(|s| -> io::Result<()> {
let mut handles = vec![];
// Spawn `num_threads - 1`, as the current thread is also a worker.
for _ in 1..num_threads {
handles.push(s.spawn(|_| parallel_worker(paths, &next_path_idx)));
}
parallel_worker(paths, &next_path_idx)?;
for handle in handles {
handle.join().unwrap()?;
}
Ok(())
})
.unwrap()
}

View File

@@ -3,14 +3,13 @@
//!
use crate::relish::RelishTag;
use crate::repository::WALRecord;
use crate::ZTimelineId;
use crate::repository::{BlockNumber, ZenithWalRecord};
use crate::{ZTenantId, ZTimelineId};
use anyhow::Result;
use bytes::Bytes;
use serde::{Deserialize, Serialize};
use std::fmt;
use std::path::PathBuf;
use std::sync::Arc;
use zenith_utils::lsn::Lsn;
@@ -27,6 +26,18 @@ pub struct SegmentTag {
pub segno: u32,
}
/// SegmentBlk represents a block number within a segment, or the size of segment.
///
/// This is separate from BlockNumber, which is used for block number within the
/// whole relish. Since this is just a type alias, the compiler will let you mix
/// them freely, but we use the type alias as documentation to make it clear
/// which one we're dealing with.
///
/// (We could turn this into "struct SegmentBlk(u32)" to forbid accidentally
/// assigning a BlockNumber to SegmentBlk or vice versa, but that makes
/// operations more verbose).
pub type SegmentBlk = u32;
impl fmt::Display for SegmentTag {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{}.{}", self.rel, self.segno)
@@ -34,15 +45,16 @@ impl fmt::Display for SegmentTag {
}
impl SegmentTag {
pub const fn from_blknum(rel: RelishTag, blknum: u32) -> SegmentTag {
SegmentTag {
rel,
segno: blknum / RELISH_SEG_SIZE,
}
}
pub fn blknum_in_seg(&self, blknum: u32) -> bool {
blknum / RELISH_SEG_SIZE == self.segno
/// Given a relish and block number, calculate the corresponding segment and
/// block number within the segment.
pub const fn from_blknum(rel: RelishTag, blknum: BlockNumber) -> (SegmentTag, SegmentBlk) {
(
SegmentTag {
rel,
segno: blknum / RELISH_SEG_SIZE,
},
blknum % RELISH_SEG_SIZE,
)
}
}
@@ -52,23 +64,10 @@ impl SegmentTag {
///
/// A page version can be stored as a full page image, or as WAL record that needs
/// to be applied over the previous page version to reconstruct this version.
///
/// It's also possible to have both a WAL record and a page image in the same
/// PageVersion. That happens if page version is originally stored as a WAL record
/// but it is later reconstructed by a GetPage@LSN request by performing WAL
/// redo. The get_page_at_lsn() code will store the reconstructed pag image next to
/// the WAL record in that case. TODO: That's pretty accidental, not the result
/// of any grand design. If we want to keep reconstructed page versions around, we
/// probably should have a separate buffer cache so that we could control the
/// replacement policy globally. Or if we keep a reconstructed page image, we
/// could throw away the WAL record.
///
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PageVersion {
/// an 8kb page image
pub page_image: Option<Bytes>,
/// WAL record to get from previous page version to this one.
pub record: Option<WALRecord>,
pub enum PageVersion {
Page(Bytes),
Wal(ZenithWalRecord),
}
///
@@ -79,7 +78,7 @@ pub struct PageVersion {
/// 'records' contains the records to apply over the base image.
///
pub struct PageReconstructData {
pub records: Vec<WALRecord>,
pub records: Vec<(Lsn, ZenithWalRecord)>,
pub page_img: Option<Bytes>,
}
@@ -87,13 +86,15 @@ pub struct PageReconstructData {
pub enum PageReconstructResult {
/// Got all the data needed to reconstruct the requested page
Complete,
/// This layer didn't contain all the required data, the caller should collect
/// more data from the returned predecessor layer at the returned LSN.
Continue(Lsn, Arc<dyn Layer>),
/// This layer didn't contain all the required data, the caller should look up
/// the predecessor layer at the returned LSN and collect more data from there.
Continue(Lsn),
/// This layer didn't contain data needed to reconstruct the page version at
/// the returned LSN. This is usually considered an error, but might be OK
/// in some circumstances.
Missing(Lsn),
/// Use the cached image at `cached_img_lsn` as the base image
Cached,
}
///
@@ -105,30 +106,27 @@ pub enum PageReconstructResult {
/// in-memory and on-disk layers.
///
pub trait Layer: Send + Sync {
fn get_tenant_id(&self) -> ZTenantId;
/// Identify the timeline this relish belongs to
fn get_timeline_id(&self) -> ZTimelineId;
/// Identify the relish segment
fn get_seg_tag(&self) -> SegmentTag;
/// Inclusive start bound of the LSN range that this layer hold
/// Inclusive start bound of the LSN range that this layer holds
fn get_start_lsn(&self) -> Lsn;
/// 'end_lsn' meaning depends on the layer kind:
/// - in-memory layer is either unbounded (end_lsn = MAX_LSN) or dropped (end_lsn = drop_lsn)
/// - image layer represents snapshot at one LSN, so end_lsn = lsn
/// - delta layer has end_lsn
/// Exclusive end bound of the LSN range that this layer holds.
///
/// TODO Is end_lsn always exclusive for all layer kinds?
/// - For an open in-memory layer, this is MAX_LSN.
/// - For a frozen in-memory layer or a delta layer, this is a valid end bound.
/// - An image layer represents snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
fn get_end_lsn(&self) -> Lsn;
/// Is the segment represented by this layer dropped by PostgreSQL?
fn is_dropped(&self) -> bool;
/// Gets the physical location of the layer on disk.
/// Some layers, such as in-memory, might not have the location.
fn path(&self) -> Option<PathBuf>;
/// Filename used to store this layer on disk. (Even in-memory layers
/// implement this, to print a handy unique identifier for the layer for
/// log messages, even though they're never not on disk.)
@@ -140,24 +138,24 @@ pub trait Layer: Send + Sync {
/// It is up to the caller to collect more data from previous layer and
/// perform WAL redo, if necessary.
///
/// Note that the 'blknum' is the offset of the page from the beginning
/// of the *relish*, not the beginning of the segment. The requested
/// 'blknum' must be covered by this segment.
/// `cached_img_lsn` should be set to a cached page image's lsn < `lsn`.
/// This function will only return data after `cached_img_lsn`.
///
/// See PageReconstructResult for possible return values. The collected data
/// is appended to reconstruct_data; the caller should pass an empty struct
/// on first call. If this returns PageReconstructResult::Continue, call
/// again on the returned predecessor layer with the same 'reconstruct_data'
/// on first call. If this returns PageReconstructResult::Continue, look up
/// the predecessor layer and call again with the same 'reconstruct_data'
/// to collect more data.
fn get_page_reconstruct_data(
&self,
blknum: u32,
blknum: SegmentBlk,
lsn: Lsn,
cached_img_lsn: Option<Lsn>,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult>;
/// Return size of the segment at given LSN. (Only for blocky relations.)
fn get_seg_size(&self, lsn: Lsn) -> Result<u32>;
fn get_seg_size(&self, lsn: Lsn) -> Result<SegmentBlk>;
/// Does the segment exist at given LSN? Or was it dropped before it.
fn get_seg_exists(&self, lsn: Lsn) -> Result<bool>;
@@ -168,6 +166,9 @@ pub trait Layer: Send + Sync {
/// the previous non-incremental layer.
fn is_incremental(&self) -> bool;
/// Returns true for layers that are represented in memory.
fn is_in_memory(&self) -> bool;
/// Release memory used by this layer. There is no corresponding 'load'
/// function, that's done implicitly when you call one of the get-functions.
fn unload(&self) -> Result<()>;

View File

@@ -1,45 +1,26 @@
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use std::path::PathBuf;
use std::time::Duration;
pub mod basebackup;
pub mod branches;
pub mod config;
pub mod http;
pub mod import_datadir;
pub mod layered_repository;
pub mod page_cache;
pub mod page_service;
pub mod relish;
pub mod remote_storage;
pub mod repository;
pub mod tenant_mgr;
pub mod tenant_threads;
pub mod thread_mgr;
pub mod virtual_file;
pub mod walingest;
pub mod walreceiver;
pub mod walrecord;
pub mod walredo;
use lazy_static::lazy_static;
use zenith_metrics::{register_int_gauge_vec, IntGaugeVec};
pub mod basebackup;
pub mod branches;
pub mod http;
pub mod layered_repository;
pub mod page_service;
pub mod relish;
mod relish_storage;
pub mod repository;
pub mod restore_local_repo;
pub mod tenant_mgr;
pub mod waldecoder;
pub mod walreceiver;
pub mod walredo;
pub mod defaults {
use std::time::Duration;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
pub const DEFAULT_PG_LISTEN_ADDR: &str = "127.0.0.1:64000"; // can't format! const yet...
pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 9898;
pub const DEFAULT_HTTP_LISTEN_ADDR: &str = "127.0.0.1:9898";
// FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB
// would be more appropriate. But a low value forces the code to be exercised more,
// which is good for now to trigger bugs.
pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024;
pub const DEFAULT_CHECKPOINT_PERIOD: Duration = Duration::from_secs(1);
pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
pub const DEFAULT_GC_PERIOD: Duration = Duration::from_secs(100);
pub const DEFAULT_SUPERUSER: &str = "zenith_admin";
}
use zenith_utils::zid::{ZTenantId, ZTimelineId};
lazy_static! {
static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!(
@@ -52,141 +33,13 @@ lazy_static! {
pub const LOG_FILE_NAME: &str = "pageserver.log";
#[derive(Debug, Clone)]
pub struct PageServerConf {
pub daemonize: bool,
pub listen_pg_addr: String,
pub listen_http_addr: String,
// Flush out an inmemory layer, if it's holding WAL older than this
// This puts a backstop on how much WAL needs to be re-digested if the
// page server crashes.
pub checkpoint_distance: u64,
pub checkpoint_period: Duration,
pub gc_horizon: u64,
pub gc_period: Duration,
pub superuser: String,
// Repository directory, relative to current working directory.
// Normally, the page server changes the current working directory
// to the repository, and 'workdir' is always '.'. But we don't do
// that during unit testing, because the current directory is global
// to the process but different unit tests work on different
// repositories.
pub workdir: PathBuf,
pub pg_distrib_dir: PathBuf,
pub auth_type: AuthType,
pub auth_validation_public_key_path: Option<PathBuf>,
pub relish_storage_config: Option<RelishStorageConfig>,
}
impl PageServerConf {
//
// Repository paths, relative to workdir.
//
fn tenants_path(&self) -> PathBuf {
self.workdir.join("tenants")
}
fn tenant_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenants_path().join(tenantid.to_string())
}
fn tags_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenant_path(tenantid).join("refs").join("tags")
}
fn tag_path(&self, tag_name: &str, tenantid: &ZTenantId) -> PathBuf {
self.tags_path(tenantid).join(tag_name)
}
fn branches_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenant_path(tenantid).join("refs").join("branches")
}
fn branch_path(&self, branch_name: &str, tenantid: &ZTenantId) -> PathBuf {
self.branches_path(tenantid).join(branch_name)
}
fn timelines_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenant_path(tenantid).join("timelines")
}
fn timeline_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timelines_path(tenantid).join(timelineid.to_string())
}
fn ancestor_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timeline_path(timelineid, tenantid).join("ancestor")
}
fn wal_dir_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timeline_path(timelineid, tenantid).join("wal")
}
//
// Postgres distribution paths
//
pub fn pg_bin_dir(&self) -> PathBuf {
self.pg_distrib_dir.join("bin")
}
pub fn pg_lib_dir(&self) -> PathBuf {
self.pg_distrib_dir.join("lib")
}
#[cfg(test)]
fn test_repo_dir(test_name: &str) -> PathBuf {
PathBuf::from(format!("../tmp_check/test_{}", test_name))
}
#[cfg(test)]
fn dummy_conf(repo_dir: PathBuf) -> Self {
PageServerConf {
daemonize: false,
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: Duration::from_secs(10),
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: Duration::from_secs(10),
listen_pg_addr: "127.0.0.1:5430".to_string(),
listen_http_addr: "127.0.0.1:9898".to_string(),
superuser: "zenith_admin".to_string(),
workdir: repo_dir,
pg_distrib_dir: "".into(),
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
relish_storage_config: None,
}
}
}
/// External relish storage configuration, enough for creating a client for that storage.
#[derive(Debug, Clone)]
pub enum RelishStorageConfig {
/// Root folder to place all stored relish data into.
LocalFs(PathBuf),
AwsS3(S3Config),
}
/// AWS S3 bucket coordinates and access credentials to manage the bucket contents (read and write).
#[derive(Clone)]
pub struct S3Config {
pub bucket_name: String,
pub bucket_region: String,
pub access_key_id: Option<String>,
pub secret_access_key: Option<String>,
}
impl std::fmt::Debug for S3Config {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("S3Config")
.field("bucket_name", &self.bucket_name)
.field("bucket_region", &self.bucket_region)
.finish()
}
/// Config for the Repository checkpointer
#[derive(Debug, Clone, Copy)]
pub enum CheckpointConfig {
// Flush in-memory data that is older than this
Distance(u64),
// Flush all in-memory data
Flush,
// Flush all in-memory data and reconstruct all page images
Forced,
}

View File

@@ -0,0 +1,778 @@
//!
//! Global page cache
//!
//! The page cache uses up most of the memory in the page server. It is shared
//! by all tenants, and it is used to store different kinds of pages. Sharing
//! the cache allows memory to be dynamically allocated where it's needed the
//! most.
//!
//! The page cache consists of fixed-size buffers, 8 kB each to match the
//! PostgreSQL buffer size, and a Slot struct for each buffer to contain
//! information about what's stored in the buffer.
//!
//! # Locking
//!
//! There are two levels of locking involved: There's one lock for the "mapping"
//! from page identifier (tenant ID, timeline ID, rel, block, LSN) to the buffer
//! slot, and a separate lock on each slot. To read or write the contents of a
//! slot, you must hold the lock on the slot in read or write mode,
//! respectively. To change the mapping of a slot, i.e. to evict a page or to
//! assign a buffer for a page, you must hold the mapping lock and the lock on
//! the slot at the same time.
//!
//! Whenever you need to hold both locks simultenously, the slot lock must be
//! acquired first. This consistent ordering avoids deadlocks. To look up a page
//! in the cache, you would first look up the mapping, while holding the mapping
//! lock, and then lock the slot. You must release the mapping lock in between,
//! to obey the lock ordering and avoid deadlock.
//!
//! A slot can momentarily have invalid contents, even if it's already been
//! inserted to the mapping, but you must hold the write-lock on the slot until
//! the contents are valid. If you need to release the lock without initializing
//! the contents, you must remove the mapping first. We make that easy for the
//! callers with PageWriteGuard: when lock_for_write() returns an uninitialized
//! page, the caller must explicitly call guard.mark_valid() after it has
//! initialized it. If the guard is dropped without calling mark_valid(), the
//! mapping is automatically removed and the slot is marked free.
//!
use std::{
collections::{hash_map::Entry, HashMap},
convert::TryInto,
sync::{
atomic::{AtomicU8, AtomicUsize, Ordering},
RwLock, RwLockReadGuard, RwLockWriteGuard,
},
};
use once_cell::sync::OnceCell;
use tracing::error;
use zenith_utils::{
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
use crate::layered_repository::writeback_ephemeral_file;
use crate::{config::PageServerConf, relish::RelTag};
static PAGE_CACHE: OnceCell<PageCache> = OnceCell::new();
const TEST_PAGE_CACHE_SIZE: usize = 10;
///
/// Initialize the page cache. This must be called once at page server startup.
///
pub fn init(conf: &'static PageServerConf) {
if PAGE_CACHE
.set(PageCache::new(conf.page_cache_size))
.is_err()
{
panic!("page cache already initialized");
}
}
///
/// Get a handle to the page cache.
///
pub fn get() -> &'static PageCache {
//
// In unit tests, page server startup doesn't happen and no one calls
// page_cache::init(). Initialize it here with a tiny cache, so that the
// page cache is usable in unit tests.
//
if cfg!(test) {
PAGE_CACHE.get_or_init(|| PageCache::new(TEST_PAGE_CACHE_SIZE))
} else {
PAGE_CACHE.get().expect("page cache not initialized")
}
}
pub const PAGE_SZ: usize = postgres_ffi::pg_constants::BLCKSZ as usize;
const MAX_USAGE_COUNT: u8 = 5;
///
/// CacheKey uniquely identifies a "thing" to cache in the page cache.
///
#[derive(Debug, PartialEq, Eq, Clone)]
enum CacheKey {
MaterializedPage {
hash_key: MaterializedPageHashKey,
lsn: Lsn,
},
EphemeralPage {
file_id: u64,
blkno: u32,
},
}
#[derive(Debug, PartialEq, Eq, Hash, Clone)]
struct MaterializedPageHashKey {
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
}
#[derive(Clone)]
struct Version {
lsn: Lsn,
slot_idx: usize,
}
struct Slot {
inner: RwLock<SlotInner>,
usage_count: AtomicU8,
}
struct SlotInner {
key: Option<CacheKey>,
buf: &'static mut [u8; PAGE_SZ],
dirty: bool,
}
impl Slot {
/// Increment usage count on the buffer, with ceiling at MAX_USAGE_COUNT.
fn inc_usage_count(&self) {
let _ = self
.usage_count
.fetch_update(Ordering::Relaxed, Ordering::Relaxed, |val| {
if val == MAX_USAGE_COUNT {
None
} else {
Some(val + 1)
}
});
}
/// Decrement usage count on the buffer, unless it's already zero. Returns
/// the old usage count.
fn dec_usage_count(&self) -> u8 {
let count_res =
self.usage_count
.fetch_update(Ordering::Relaxed, Ordering::Relaxed, |val| {
if val == 0 {
None
} else {
Some(val - 1)
}
});
match count_res {
Ok(usage_count) => usage_count,
Err(usage_count) => usage_count,
}
}
}
pub struct PageCache {
/// This contains the mapping from the cache key to buffer slot that currently
/// contains the page, if any.
///
/// TODO: This is protected by a single lock. If that becomes a bottleneck,
/// this HashMap can be replaced with a more concurrent version, there are
/// plenty of such crates around.
///
/// If you add support for caching different kinds of objects, each object kind
/// can have a separate mapping map, next to this field.
materialized_page_map: RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
ephemeral_page_map: RwLock<HashMap<(u64, u32), usize>>,
/// The actual buffers with their metadata.
slots: Box<[Slot]>,
/// Index of the next candidate to evict, for the Clock replacement algorithm.
/// This is interpreted modulo the page cache size.
next_evict_slot: AtomicUsize,
}
///
/// PageReadGuard is a "lease" on a buffer, for reading. The page is kept locked
/// until the guard is dropped.
///
pub struct PageReadGuard<'i>(RwLockReadGuard<'i, SlotInner>);
impl std::ops::Deref for PageReadGuard<'_> {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
self.0.buf
}
}
///
/// PageWriteGuard is a lease on a buffer for modifying it. The page is kept locked
/// until the guard is dropped.
///
/// Counterintuitively, this is used even for a read, if the requested page is not
/// currently found in the page cache. In that case, the caller of lock_for_read()
/// is expected to fill in the page contents and call mark_valid(). Similarly
/// lock_for_write() can return an invalid buffer that the caller is expected to
/// to initialize.
///
pub struct PageWriteGuard<'i> {
inner: RwLockWriteGuard<'i, SlotInner>,
// Are the page contents currently valid?
valid: bool,
}
impl std::ops::DerefMut for PageWriteGuard<'_> {
fn deref_mut(&mut self) -> &mut Self::Target {
self.inner.buf
}
}
impl std::ops::Deref for PageWriteGuard<'_> {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
self.inner.buf
}
}
impl PageWriteGuard<'_> {
/// Mark that the buffer contents are now valid.
pub fn mark_valid(&mut self) {
assert!(self.inner.key.is_some());
assert!(
!self.valid,
"mark_valid called on a buffer that was already valid"
);
self.valid = true;
}
pub fn mark_dirty(&mut self) {
// only ephemeral pages can be dirty ATM.
assert!(matches!(
self.inner.key,
Some(CacheKey::EphemeralPage { .. })
));
self.inner.dirty = true;
}
}
impl Drop for PageWriteGuard<'_> {
///
/// If the buffer was allocated for a page that was not already in the
/// cache, but the lock_for_read/write() caller dropped the buffer without
/// initializing it, remove the mapping from the page cache.
///
fn drop(&mut self) {
assert!(self.inner.key.is_some());
if !self.valid {
let self_key = self.inner.key.as_ref().unwrap();
PAGE_CACHE.get().unwrap().remove_mapping(self_key);
self.inner.key = None;
self.inner.dirty = false;
}
}
}
/// lock_for_read() return value
pub enum ReadBufResult<'a> {
Found(PageReadGuard<'a>),
NotFound(PageWriteGuard<'a>),
}
/// lock_for_write() return value
pub enum WriteBufResult<'a> {
Found(PageWriteGuard<'a>),
NotFound(PageWriteGuard<'a>),
}
impl PageCache {
//
// Section 1.1: Public interface functions for looking up and memorizing materialized page
// versions in the page cache
//
/// Look up a materialized page version.
///
/// The 'lsn' is an upper bound, this will return the latest version of
/// the given block, but not newer than 'lsn'. Returns the actual LSN of the
/// returned page.
pub fn lookup_materialized_page(
&self,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
lsn: Lsn,
) -> Option<(Lsn, PageReadGuard)> {
let mut cache_key = CacheKey::MaterializedPage {
hash_key: MaterializedPageHashKey {
tenant_id,
timeline_id,
rel_tag,
blknum,
},
lsn,
};
if let Some(guard) = self.try_lock_for_read(&mut cache_key) {
if let CacheKey::MaterializedPage { hash_key: _, lsn } = cache_key {
Some((lsn, guard))
} else {
panic!("unexpected key type in slot");
}
} else {
None
}
}
///
/// Store an image of the given page in the cache.
///
pub fn memorize_materialized_page(
&self,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
lsn: Lsn,
img: &[u8],
) {
let cache_key = CacheKey::MaterializedPage {
hash_key: MaterializedPageHashKey {
tenant_id,
timeline_id,
rel_tag,
blknum,
},
lsn,
};
match self.lock_for_write(&cache_key) {
WriteBufResult::Found(write_guard) => {
// We already had it in cache. Another thread must've put it there
// concurrently. Check that it had the same contents that we
// replayed.
assert!(*write_guard == img);
}
WriteBufResult::NotFound(mut write_guard) => {
write_guard.copy_from_slice(img);
write_guard.mark_valid();
}
}
}
// Section 1.2: Public interface functions for working with Ephemeral pages.
pub fn read_ephemeral_buf(&self, file_id: u64, blkno: u32) -> ReadBufResult {
let mut cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
pub fn write_ephemeral_buf(&self, file_id: u64, blkno: u32) -> WriteBufResult {
let cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_write(&cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_ephemeral(&self, drop_file_id: u64) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::EphemeralPage { file_id, blkno: _ } if *file_id == drop_file_id => {
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
inner.dirty = false;
}
_ => {}
}
}
}
}
//
// Section 2: Internal interface functions for lookup/update.
//
// To add support for a new kind of "thing" to cache, you will need
// to add public interface routines above, and code to deal with the
// "mappings" after this section. But the routines in this section should
// not require changes.
/// Look up a page in the cache.
///
/// If the search criteria is not exact, *cache_key is updated with the key
/// for exact key of the returned page. (For materialized pages, that means
/// that the LSN in 'cache_key' is updated with the LSN of the returned page
/// version.)
///
/// If no page is found, returns None and *cache_key is left unmodified.
///
fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
let cache_key_orig = cache_key.clone();
if let Some(slot_idx) = self.search_mapping(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.read().unwrap();
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageReadGuard(inner));
} else {
// search_mapping might have modified the search key; restore it.
*cache_key = cache_key_orig;
}
}
None
}
/// Return a locked buffer for given block.
///
/// Like try_lock_for_read(), if the search criteria is not exact and the
/// page is already found in the cache, *cache_key is updated.
///
/// If the page is not found in the cache, this allocates a new buffer for
/// it. The caller may then initialize the buffer with the contents, and
/// call mark_valid().
///
/// Example usage:
///
/// ```ignore
/// let cache = page_cache::get();
///
/// match cache.lock_for_read(&key) {
/// ReadBufResult::Found(read_guard) => {
/// // The page was found in cache. Use it
/// },
/// ReadBufResult::NotFound(write_guard) => {
/// // The page was not found in cache. Read it from disk into the
/// // buffer.
/// //read_my_page_from_disk(write_guard);
///
/// // The buffer contents are now valid. Tell the page cache.
/// write_guard.mark_valid();
/// },
/// }
/// ```
///
fn lock_for_read(&self, cache_key: &mut CacheKey) -> ReadBufResult {
loop {
// First check if the key already exists in the cache.
if let Some(read_guard) = self.try_lock_for_read(cache_key) {
return ReadBufResult::Found(read_guard);
}
// Not found. Find a victim buffer
let (slot_idx, mut inner) = self.find_victim();
// Insert mapping for this. At this point, we may find that another
// thread did the same thing concurrently. In that case, we evicted
// our victim buffer unnecessarily. Put it into the free list and
// continue with the slot that the other thread chose.
if let Some(_existing_slot_idx) = self.try_insert_mapping(cache_key, slot_idx) {
// TODO: put to free list
// We now just loop back to start from beginning. This is not
// optimal, we'll perform the lookup in the mapping again, which
// is not really necessary because we already got
// 'existing_slot_idx'. But this shouldn't happen often enough
// to matter much.
continue;
}
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
return ReadBufResult::NotFound(PageWriteGuard {
inner,
valid: false,
});
}
}
/// Look up a page in the cache and lock it in write mode. If it's not
/// found, returns None.
///
/// When locking a page for writing, the search criteria is always "exact".
fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
if let Some(slot_idx) = self.search_mapping_for_write(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we don't released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.write().unwrap();
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageWriteGuard { inner, valid: true });
}
}
None
}
/// Return a write-locked buffer for given block.
///
/// Similar to lock_for_read(), but the returned buffer is write-locked and
/// may be modified by the caller even if it's already found in the cache.
fn lock_for_write(&self, cache_key: &CacheKey) -> WriteBufResult {
loop {
// First check if the key already exists in the cache.
if let Some(write_guard) = self.try_lock_for_write(cache_key) {
return WriteBufResult::Found(write_guard);
}
// Not found. Find a victim buffer
let (slot_idx, mut inner) = self.find_victim();
// Insert mapping for this. At this point, we may find that another
// thread did the same thing concurrently. In that case, we evicted
// our victim buffer unnecessarily. Put it into the free list and
// continue with the slot that the other thread chose.
if let Some(_existing_slot_idx) = self.try_insert_mapping(cache_key, slot_idx) {
// TODO: put to free list
// We now just loop back to start from beginning. This is not
// optimal, we'll perform the lookup in the mapping again, which
// is not really necessary because we already got
// 'existing_slot_idx'. But this shouldn't happen often enough
// to matter much.
continue;
}
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
return WriteBufResult::NotFound(PageWriteGuard {
inner,
valid: false,
});
}
}
//
// Section 3: Mapping functions
//
/// Search for a page in the cache using the given search key.
///
/// Returns the slot index, if any. If the search criteria is not exact,
/// *cache_key is updated with the actual key of the found page.
///
/// NOTE: We don't hold any lock on the mapping on return, so the slot might
/// get recycled for an unrelated page immediately after this function
/// returns. The caller is responsible for re-checking that the slot still
/// contains the page with the same key before using it.
///
fn search_mapping(&self, cache_key: &mut CacheKey) -> Option<usize> {
match cache_key {
CacheKey::MaterializedPage { hash_key, lsn } => {
let map = self.materialized_page_map.read().unwrap();
let versions = map.get(hash_key)?;
let version_idx = match versions.binary_search_by_key(lsn, |v| v.lsn) {
Ok(version_idx) => version_idx,
Err(0) => return None,
Err(version_idx) => version_idx - 1,
};
let version = &versions[version_idx];
*lsn = version.lsn;
Some(version.slot_idx)
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
}
}
/// Search for a page in the cache using the given search key.
///
/// Like 'search_mapping, but performs an "exact" search. Used for
/// allocating a new buffer.
fn search_mapping_for_write(&self, key: &CacheKey) -> Option<usize> {
match key {
CacheKey::MaterializedPage { hash_key, lsn } => {
let map = self.materialized_page_map.read().unwrap();
let versions = map.get(hash_key)?;
if let Ok(version_idx) = versions.binary_search_by_key(lsn, |v| v.lsn) {
Some(versions[version_idx].slot_idx)
} else {
None
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
}
}
///
/// Remove mapping for given key.
///
fn remove_mapping(&self, old_key: &CacheKey) {
match old_key {
CacheKey::MaterializedPage {
hash_key: old_hash_key,
lsn: old_lsn,
} => {
let mut map = self.materialized_page_map.write().unwrap();
if let Entry::Occupied(mut old_entry) = map.entry(old_hash_key.clone()) {
let versions = old_entry.get_mut();
if let Ok(version_idx) = versions.binary_search_by_key(old_lsn, |v| v.lsn) {
versions.remove(version_idx);
if versions.is_empty() {
old_entry.remove_entry();
}
}
} else {
panic!("could not find old key in mapping")
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
map.remove(&(*file_id, *blkno))
.expect("could not find old key in mapping");
}
}
}
///
/// Insert mapping for given key.
///
/// If a mapping already existed for the given key, returns the slot index
/// of the existing mapping and leaves it untouched.
fn try_insert_mapping(&self, new_key: &CacheKey, slot_idx: usize) -> Option<usize> {
match new_key {
CacheKey::MaterializedPage {
hash_key: new_key,
lsn: new_lsn,
} => {
let mut map = self.materialized_page_map.write().unwrap();
let versions = map.entry(new_key.clone()).or_default();
match versions.binary_search_by_key(new_lsn, |v| v.lsn) {
Ok(version_idx) => Some(versions[version_idx].slot_idx),
Err(version_idx) => {
versions.insert(
version_idx,
Version {
lsn: *new_lsn,
slot_idx,
},
);
None
}
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
match map.entry((*file_id, *blkno)) {
Entry::Occupied(entry) => Some(*entry.get()),
Entry::Vacant(entry) => {
entry.insert(slot_idx);
None
}
}
}
}
}
//
// Section 4: Misc internal helpers
//
/// Find a slot to evict.
///
/// On return, the slot is empty and write-locked.
fn find_victim(&self) -> (usize, RwLockWriteGuard<SlotInner>) {
let iter_limit = self.slots.len() * 2;
let mut iters = 0;
loop {
let slot_idx = self.next_evict_slot.fetch_add(1, Ordering::Relaxed) % self.slots.len();
let slot = &self.slots[slot_idx];
if slot.dec_usage_count() == 0 || iters >= iter_limit {
let mut inner = slot.inner.write().unwrap();
if let Some(old_key) = &inner.key {
if inner.dirty {
if let Err(err) = Self::writeback(old_key, inner.buf) {
// Writing the page to disk failed.
//
// FIXME: What to do here, when? We could propagate the error to the
// caller, but victim buffer is generally unrelated to the original
// call. It can even belong to a different tenant. Currently, we
// report the error to the log and continue the clock sweep to find
// a different victim. But if the problem persists, the page cache
// could fill up with dirty pages that we cannot evict, and we will
// loop retrying the writebacks indefinitely.
error!("writeback of buffer {:?} failed: {}", old_key, err);
continue;
}
}
// remove mapping for old buffer
self.remove_mapping(old_key);
inner.dirty = false;
inner.key = None;
}
return (slot_idx, inner);
}
iters += 1;
}
}
fn writeback(cache_key: &CacheKey, buf: &[u8]) -> Result<(), std::io::Error> {
match cache_key {
CacheKey::MaterializedPage {
hash_key: _,
lsn: _,
} => {
panic!("unexpected dirty materialized page");
}
CacheKey::EphemeralPage { file_id, blkno } => {
writeback_ephemeral_file(*file_id, *blkno, buf)
}
}
}
/// Initialize a new page cache
///
/// This should be called only once at page server startup.
fn new(num_pages: usize) -> Self {
assert!(num_pages > 0, "page cache size must be > 0");
let page_buffer = Box::leak(vec![0u8; num_pages * PAGE_SZ].into_boxed_slice());
let slots = page_buffer
.chunks_exact_mut(PAGE_SZ)
.map(|chunk| {
let buf: &mut [u8; PAGE_SZ] = chunk.try_into().unwrap();
Slot {
inner: RwLock::new(SlotInner {
key: None,
buf,
dirty: false,
}),
usage_count: AtomicU8::new(0),
}
})
.collect();
Self {
materialized_page_map: Default::default(),
ephemeral_page_map: Default::default(),
slots,
next_evict_slot: AtomicUsize::new(0),
}
}
}

View File

@@ -10,35 +10,35 @@
// *callmemaybe <zenith timelineid> $url* -- ask pageserver to start walreceiver on $url
//
use anyhow::{anyhow, bail, ensure, Result};
use anyhow::{bail, ensure, Context, Result};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use lazy_static::lazy_static;
use log::*;
use regex::Regex;
use std::io;
use std::net::TcpListener;
use std::str;
use std::str::FromStr;
use std::sync::Arc;
use std::thread;
use std::{io, net::TcpStream};
use std::sync::{Arc, RwLockReadGuard};
use tracing::*;
use zenith_metrics::{register_histogram_vec, HistogramVec};
use zenith_utils::auth::{self, JwtAuth};
use zenith_utils::auth::{Claims, Scope};
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::is_socket_read_timed_out;
use zenith_utils::postgres_backend::PostgresBackend;
use zenith_utils::postgres_backend::{self, AuthType};
use zenith_utils::pq_proto::{
BeMessage, FeMessage, RowDescriptor, HELLO_WORLD_ROW, SINGLE_COL_ROWDESC,
};
use zenith_utils::pq_proto::{BeMessage, FeMessage, RowDescriptor, SINGLE_COL_ROWDESC};
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use crate::basebackup;
use crate::branches;
use crate::config::PageServerConf;
use crate::relish::*;
use crate::repository::Timeline;
use crate::tenant_mgr;
use crate::thread_mgr;
use crate::thread_mgr::ThreadKind;
use crate::walreceiver;
use crate::PageServerConf;
use crate::CheckpointConfig;
// Wrapped in libpq CopyData
enum PagestreamFeMessage {
@@ -187,26 +187,72 @@ pub fn thread_main(
listener: TcpListener,
auth_type: AuthType,
) -> anyhow::Result<()> {
loop {
let (socket, peer_addr) = listener.accept()?;
debug!("accepted connection from {}", peer_addr);
socket.set_nodelay(true).unwrap();
let local_auth = auth.clone();
thread::spawn(move || {
if let Err(err) = page_service_conn_main(conf, local_auth, socket, auth_type) {
error!("error: {}", err);
listener.set_nonblocking(true)?;
let basic_rt = tokio::runtime::Builder::new_current_thread()
.enable_io()
.build()?;
let tokio_listener = {
let _guard = basic_rt.enter();
tokio::net::TcpListener::from_std(listener)
}?;
// Wait for a new connection to arrive, or for server shutdown.
while let Some(res) = basic_rt.block_on(async {
let shutdown_watcher = thread_mgr::shutdown_watcher();
tokio::select! {
biased;
_ = shutdown_watcher => {
// We were requested to shut down.
None
}
});
res = tokio_listener.accept() => {
Some(res)
}
}
}) {
match res {
Ok((socket, peer_addr)) => {
// Connection established. Spawn a new thread to handle it.
debug!("accepted connection from {}", peer_addr);
let local_auth = auth.clone();
// PageRequestHandler threads are not associated with any particular
// timeline in the thread manager. In practice most connections will
// only deal with a particular timeline, but we don't know which one
// yet.
if let Err(err) = thread_mgr::spawn(
ThreadKind::PageRequestHandler,
None,
None,
"serving Page Service thread",
move || page_service_conn_main(conf, local_auth, socket, auth_type),
) {
// Thread creation failed. Log the error and continue.
error!("could not spawn page service thread: {:?}", err);
}
}
Err(err) => {
// accept() failed. Log the error, and loop back to retry on next connection.
error!("accept() failed: {:?}", err);
}
}
}
debug!("page_service loop terminated");
Ok(())
}
fn page_service_conn_main(
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
socket: TcpStream,
socket: tokio::net::TcpStream,
auth_type: AuthType,
) -> anyhow::Result<()> {
// Immediatsely increment the gauge, then create a job to decrement it on thread exit.
// Immediately increment the gauge, then create a job to decrement it on thread exit.
// One of the pros of `defer!` is that this will *most probably*
// get called, even in presence of panics.
let gauge = crate::LIVE_CONNECTIONS_COUNT.with_label_values(&["page_service"]);
@@ -215,8 +261,21 @@ fn page_service_conn_main(
gauge.dec();
}
// We use Tokio to accept the connection, but the rest of the code works with a
// regular socket. Convert.
let socket = socket
.into_std()
.context("could not convert tokio::net:TcpStream to std::net::TcpStream")?;
socket
.set_nonblocking(false)
.context("could not put socket to blocking mode")?;
socket
.set_nodelay(true)
.context("could not set TCP_NODELAY")?;
let mut conn_handler = PageServerHandler::new(conf, auth);
let pgbackend = PostgresBackend::new(socket, auth_type, None)?;
let pgbackend = PostgresBackend::new(socket, auth_type, None, true)?;
pgbackend.run(&mut conn_handler)
}
@@ -260,48 +319,67 @@ impl PageServerHandler {
timelineid: ZTimelineId,
tenantid: ZTenantId,
) -> anyhow::Result<()> {
let _enter = info_span!("pagestream", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)?;
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Cannot handle pagerequests for a remote timeline")?;
/* switch client to COPYBOTH */
pgb.write_message(&BeMessage::CopyBothResponse)?;
while let Some(message) = pgb.read_message()? {
trace!("query({:?}): {:?}", timelineid, message);
while !thread_mgr::is_shutdown_requested() {
match pgb.read_message() {
Ok(message) => {
if let Some(message) = message {
trace!("query: {:?}", message);
let copy_data_bytes = match message {
FeMessage::CopyData(bytes) => bytes,
_ => continue,
};
let copy_data_bytes = match message {
FeMessage::CopyData(bytes) => bytes,
_ => continue,
};
let zenith_fe_msg = PagestreamFeMessage::parse(copy_data_bytes)?;
let zenith_fe_msg = PagestreamFeMessage::parse(copy_data_bytes)?;
let response = match zenith_fe_msg {
PagestreamFeMessage::Exists(req) => SMGR_QUERY_TIME
.with_label_values(&["get_rel_exists"])
.observe_closure_duration(|| {
self.handle_get_rel_exists_request(&*timeline, &req)
}),
PagestreamFeMessage::Nblocks(req) => SMGR_QUERY_TIME
.with_label_values(&["get_rel_size"])
.observe_closure_duration(|| self.handle_get_nblocks_request(&*timeline, &req)),
PagestreamFeMessage::GetPage(req) => SMGR_QUERY_TIME
.with_label_values(&["get_page_at_lsn"])
.observe_closure_duration(|| {
self.handle_get_page_at_lsn_request(&*timeline, &req)
}),
};
let response = match zenith_fe_msg {
PagestreamFeMessage::Exists(req) => SMGR_QUERY_TIME
.with_label_values(&["get_rel_exists"])
.observe_closure_duration(|| {
self.handle_get_rel_exists_request(timeline.as_ref(), &req)
}),
PagestreamFeMessage::Nblocks(req) => SMGR_QUERY_TIME
.with_label_values(&["get_rel_size"])
.observe_closure_duration(|| {
self.handle_get_nblocks_request(timeline.as_ref(), &req)
}),
PagestreamFeMessage::GetPage(req) => SMGR_QUERY_TIME
.with_label_values(&["get_page_at_lsn"])
.observe_closure_duration(|| {
self.handle_get_page_at_lsn_request(timeline.as_ref(), &req)
}),
};
let response = response.unwrap_or_else(|e| {
error!("error reading relation or page version: {}", e);
PagestreamBeMessage::Error(PagestreamErrorResponse {
message: e.to_string(),
})
});
let response = response.unwrap_or_else(|e| {
// print the all details to the log with {:#}, but for the client the
// error message is enough
error!("error reading relation or page version: {:?}", e);
PagestreamBeMessage::Error(PagestreamErrorResponse {
message: e.to_string(),
})
});
pgb.write_message(&BeMessage::CopyData(&response.serialize()))?;
pgb.write_message(&BeMessage::CopyData(&response.serialize()))?;
} else {
break;
}
}
Err(e) => {
if !is_socket_read_timed_out(&e) {
return Err(e);
}
}
}
}
Ok(())
}
@@ -317,7 +395,12 @@ impl PageServerHandler {
/// In either case, if the page server hasn't received the WAL up to the
/// requested LSN yet, we will wait for it to arrive. The return value is
/// the LSN that should be used to look up the page versions.
fn wait_or_get_last_lsn(timeline: &dyn Timeline, lsn: Lsn, latest: bool) -> Result<Lsn> {
fn wait_or_get_last_lsn(
timeline: &dyn Timeline,
mut lsn: Lsn,
latest: bool,
latest_gc_cutoff_lsn: &RwLockReadGuard<Lsn>,
) -> Result<Lsn> {
if latest {
// Latest page version was requested. If LSN is given, it is a hint
// to the page server that there have been no modifications to the
@@ -338,22 +421,26 @@ impl PageServerHandler {
// walsender completes the authentication and starts streaming the
// WAL.
if lsn <= last_record_lsn {
Ok(last_record_lsn)
lsn = last_record_lsn;
} else {
timeline.wait_lsn(lsn)?;
// Since we waited for 'lsn' to arrive, that is now the last
// record LSN. (Or close enough for our purposes; the
// last-record LSN can advance immediately after we return
// anyway)
Ok(lsn)
}
} else {
if lsn == Lsn(0) {
bail!("invalid LSN(0) in request");
}
timeline.wait_lsn(lsn)?;
Ok(lsn)
}
ensure!(
lsn >= **latest_gc_cutoff_lsn,
"tried to request a page version that was garbage collected. requested at {} gc cutoff {}",
lsn, **latest_gc_cutoff_lsn
);
Ok(lsn)
}
fn handle_get_rel_exists_request(
@@ -361,8 +448,11 @@ impl PageServerHandler {
timeline: &dyn Timeline,
req: &PagestreamExistsRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_rel_exists", rel = %req.rel, req_lsn = %req.lsn).entered();
let tag = RelishTag::Relation(req.rel);
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest)?;
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?;
let exists = timeline.get_rel_exists(tag, lsn)?;
@@ -376,8 +466,10 @@ impl PageServerHandler {
timeline: &dyn Timeline,
req: &PagestreamNblocksRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_nblocks", rel = %req.rel, req_lsn = %req.lsn).entered();
let tag = RelishTag::Relation(req.rel);
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest)?;
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?;
let n_blocks = timeline.get_relish_size(tag, lsn)?;
@@ -395,9 +487,19 @@ impl PageServerHandler {
timeline: &dyn Timeline,
req: &PagestreamGetPageRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_page", rel = %req.rel, blkno = &req.blkno, req_lsn = %req.lsn)
.entered();
let tag = RelishTag::Relation(req.rel);
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest)?;
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?;
/*
// Add a 1s delay to some requests. The delayed causes the requests to
// hit the race condition from github issue #1047 more easily.
use rand::Rng;
if rand::thread_rng().gen::<u8>() < 5 {
std::thread::sleep(std::time::Duration::from_millis(1000));
}
*/
let page = timeline.get_page_at_lsn(tag, req.blkno, lsn)?;
Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse {
@@ -412,17 +514,27 @@ impl PageServerHandler {
lsn: Option<Lsn>,
tenantid: ZTenantId,
) -> anyhow::Result<()> {
// check that the timeline exists
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)?;
let span = info_span!("basebackup", timeline = %timelineid, tenant = %tenantid, lsn = field::Empty);
let _enter = span.enter();
/* switch client to COPYOUT */
// check that the timeline exists
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Cannot handle basebackup request for a remote timeline")?;
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
if let Some(lsn) = lsn {
timeline
.check_lsn_is_in_scope(lsn, &latest_gc_cutoff_lsn)
.context("invalid basebackup lsn")?;
}
// switch client to COPYOUT
pgb.write_message(&BeMessage::CopyOutResponse)?;
info!("sent CopyOut");
/* Send a tarball of the latest layer on the timeline */
{
let mut writer = CopyDataSink { pgb };
let mut basebackup = basebackup::Basebackup::new(&mut writer, &timeline, lsn)?;
span.record("lsn", &basebackup.lsn.to_string().as_str());
basebackup.send_tarball()?;
}
pgb.write_message(&BeMessage::CopyDone)?;
@@ -483,17 +595,10 @@ impl postgres_backend::Handler for PageServerHandler {
fn process_query(
&mut self,
pgb: &mut PostgresBackend,
query_string: Bytes,
query_string: &str,
) -> anyhow::Result<()> {
debug!("process query {:?}", query_string);
// remove null terminator, if any
let mut query_string = query_string;
if query_string.last() == Some(&0) {
query_string.truncate(query_string.len() - 1);
}
let query_string = std::str::from_utf8(&query_string)?;
if query_string.starts_with("pagestream ") {
let (_, params_raw) = query_string.split_at("pagestream ".len());
let params = params_raw.split(' ').collect::<Vec<_>>();
@@ -527,11 +632,6 @@ impl postgres_backend::Handler for PageServerHandler {
None
};
info!(
"got basebackup command. tenantid=\"{}\" timelineid=\"{}\" lsn=\"{:#?}\"",
tenantid, timelineid, lsn
);
// Check that the timeline exists
self.handle_basebackup_request(pgb, timelineid, lsn, tenantid)?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
@@ -541,7 +641,7 @@ impl postgres_backend::Handler for PageServerHandler {
let re = Regex::new(r"^callmemaybe ([[:xdigit:]]+) ([[:xdigit:]]+) (.*)$").unwrap();
let caps = re
.captures(query_string)
.ok_or_else(|| anyhow!("invalid callmemaybe: '{}'", query_string))?;
.with_context(|| format!("invalid callmemaybe: '{}'", query_string))?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
@@ -549,80 +649,31 @@ impl postgres_backend::Handler for PageServerHandler {
self.check_permission(Some(tenantid))?;
// Check that the timeline exists
tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)?;
let _enter =
info_span!("callmemaybe", timeline = %timelineid, tenant = %tenantid).entered();
walreceiver::launch_wal_receiver(self.conf, timelineid, &connstr, tenantid.to_owned());
// Check that the timeline exists
tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Failed to fetch local timeline for callmemaybe requests")?;
walreceiver::launch_wal_receiver(self.conf, tenantid, timelineid, &connstr)?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("branch_create ") {
let err = || anyhow!("invalid branch_create: '{}'", query_string);
// branch_create <tenantid> <branchname> <startpoint>
// TODO lazy static
// TODO: escaping, to allow branch names with spaces
let re = Regex::new(r"^branch_create ([[:xdigit:]]+) (\S+) ([^\r\n\s;]+)[\r\n\s;]*;?$")
.unwrap();
let caps = re.captures(query_string).ok_or_else(err)?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let branchname = caps.get(2).ok_or_else(err)?.as_str().to_owned();
let startpoint_str = caps.get(3).ok_or_else(err)?.as_str().to_owned();
self.check_permission(Some(tenantid))?;
let branch =
branches::create_branch(self.conf, &branchname, &startpoint_str, &tenantid)?;
let branch = serde_json::to_vec(&branch)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(&branch)]))?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("branch_list ") {
// branch_list <zenith tenantid as hex string>
let re = Regex::new(r"^branch_list ([[:xdigit:]]+)$").unwrap();
let caps = re
.captures(query_string)
.ok_or_else(|| anyhow!("invalid branch_list: '{}'", query_string))?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let branches = crate::branches::get_branches(self.conf, &tenantid)?;
let branches_buf = serde_json::to_vec(&branches)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(&branches_buf)]))?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("tenant_list") {
let tenants = crate::branches::get_tenants(self.conf)?;
let tenants_buf = serde_json::to_vec(&tenants)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(&tenants_buf)]))?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("tenant_create") {
let err = || anyhow!("invalid tenant_create: '{}'", query_string);
// tenant_create <tenantid>
let re = Regex::new(r"^tenant_create ([[:xdigit:]]+)$").unwrap();
let caps = re.captures(query_string).ok_or_else(err)?;
self.check_permission(None)?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
tenant_mgr::create_repository_for_tenant(self.conf, tenantid)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("status") {
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&HELLO_WORLD_ROW)?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.to_ascii_lowercase().starts_with("set ") {
// important because psycopg2 executes "SET datestyle TO 'ISO'"
// on connect
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("failpoints ") {
let (_, failpoints) = query_string.split_at("failpoints ".len());
for failpoint in failpoints.split(';') {
if let Some((name, actions)) = failpoint.split_once('=') {
info!("cfg failpoint: {} {}", name, actions);
fail::cfg(name, actions).unwrap();
} else {
bail!("Invalid failpoints format");
}
}
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("do_gc ") {
// Run GC immediately on given timeline.
// FIXME: This is just for tests. See test_runner/batch_others/test_gc.py.
@@ -636,7 +687,7 @@ impl postgres_backend::Handler for PageServerHandler {
let caps = re
.captures(query_string)
.ok_or_else(|| anyhow!("invalid do_gc: '{}'", query_string))?;
.with_context(|| format!("invalid do_gc: '{}'", query_string))?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
@@ -646,20 +697,20 @@ impl postgres_backend::Handler for PageServerHandler {
.unwrap_or(Ok(self.conf.gc_horizon))?;
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"layer_relfiles_total"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_relfiles_not_updated"),
RowDescriptor::int8_col(b"layer_relfiles_needed_as_tombstone"),
RowDescriptor::int8_col(b"layer_relfiles_removed"),
RowDescriptor::int8_col(b"layer_relfiles_dropped"),
RowDescriptor::int8_col(b"layer_nonrelfiles_total"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_nonrelfiles_not_updated"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_as_tombstone"),
RowDescriptor::int8_col(b"layer_nonrelfiles_removed"),
RowDescriptor::int8_col(b"layer_nonrelfiles_dropped"),
RowDescriptor::int8_col(b"elapsed"),
@@ -679,6 +730,12 @@ impl postgres_backend::Handler for PageServerHandler {
.as_bytes(),
),
Some(result.ondisk_relfiles_not_updated.to_string().as_bytes()),
Some(
result
.ondisk_relfiles_needed_as_tombstone
.to_string()
.as_bytes(),
),
Some(result.ondisk_relfiles_removed.to_string().as_bytes()),
Some(result.ondisk_relfiles_dropped.to_string().as_bytes()),
Some(result.ondisk_nonrelfiles_total.to_string().as_bytes()),
@@ -695,11 +752,36 @@ impl postgres_backend::Handler for PageServerHandler {
.as_bytes(),
),
Some(result.ondisk_nonrelfiles_not_updated.to_string().as_bytes()),
Some(
result
.ondisk_nonrelfiles_needed_as_tombstone
.to_string()
.as_bytes(),
),
Some(result.ondisk_nonrelfiles_removed.to_string().as_bytes()),
Some(result.ondisk_nonrelfiles_dropped.to_string().as_bytes()),
Some(result.elapsed.as_millis().to_string().as_bytes()),
]))?
.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("checkpoint ") {
// Run checkpoint immediately on given timeline.
// checkpoint <tenant_id> <timeline_id>
let re = Regex::new(r"^checkpoint ([[:xdigit:]]+)\s([[:xdigit:]]+)($|\s)?").unwrap();
let caps = re
.captures(query_string)
.with_context(|| format!("invalid checkpoint command: '{}'", query_string))?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Failed to fetch local timeline for checkpoint request")?;
timeline.checkpoint(CheckpointConfig::Forced)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else {
bail!("unknown command");
}

View File

@@ -224,8 +224,3 @@ impl SlruKind {
}
}
}
pub const FIRST_NONREL_RELISH_TAG: RelishTag = RelishTag::Slru {
slru: SlruKind::Clog,
segno: 0,
};

View File

@@ -1,54 +0,0 @@
//! Abstractions for the page server to store its relish layer data in the external storage.
//!
//! Main purpose of this module subtree is to provide a set of abstractions to manage the storage state
//! in a way, optimal for page server.
//!
//! The abstractions hide multiple custom external storage API implementations,
//! such as AWS S3, local filesystem, etc., located in the submodules.
mod local_fs;
mod rust_s3;
/// A queue and the background machinery behind it to upload
/// local page server layer files to external storage.
pub mod storage_uploader;
use std::path::Path;
use anyhow::Context;
/// Storage (potentially remote) API to manage its state.
#[async_trait::async_trait]
pub trait RelishStorage: Send + Sync {
type RelishStoragePath;
fn derive_destination(
page_server_workdir: &Path,
relish_local_path: &Path,
) -> anyhow::Result<Self::RelishStoragePath>;
async fn list_relishes(&self) -> anyhow::Result<Vec<Self::RelishStoragePath>>;
async fn download_relish(
&self,
from: &Self::RelishStoragePath,
to: &Path,
) -> anyhow::Result<()>;
async fn delete_relish(&self, path: &Self::RelishStoragePath) -> anyhow::Result<()>;
async fn upload_relish(&self, from: &Path, to: &Self::RelishStoragePath) -> anyhow::Result<()>;
}
fn strip_workspace_prefix<'a>(
page_server_workdir: &'a Path,
relish_local_path: &'a Path,
) -> anyhow::Result<&'a Path> {
relish_local_path
.strip_prefix(page_server_workdir)
.with_context(|| {
format!(
"Unexpected: relish local path '{}' is not relevant to server workdir",
relish_local_path.display(),
)
})
}

View File

@@ -1,158 +0,0 @@
//! Local filesystem relish storage.
//!
//! Page server already stores layer data on the server, when freezing it.
//! This storage serves a way to
//!
//! * test things locally simply
//! * allow to compabre both binary sets
//! * help validating the relish storage API
use std::{
future::Future,
path::{Path, PathBuf},
pin::Pin,
};
use anyhow::{bail, Context};
use super::{strip_workspace_prefix, RelishStorage};
pub struct LocalFs {
root: PathBuf,
}
impl LocalFs {
/// Atetmpts to create local FS relish storage, also creates the directory provided, if not exists.
pub fn new(root: PathBuf) -> anyhow::Result<Self> {
if !root.exists() {
std::fs::create_dir_all(&root).with_context(|| {
format!(
"Failed to create all directories in the given root path {}",
root.display(),
)
})?;
}
Ok(Self { root })
}
fn resolve_in_storage(&self, path: &Path) -> anyhow::Result<PathBuf> {
if path.is_relative() {
Ok(self.root.join(path))
} else if path.starts_with(&self.root) {
Ok(path.to_path_buf())
} else {
bail!(
"Path '{}' does not belong to the current storage",
path.display()
)
}
}
}
#[async_trait::async_trait]
impl RelishStorage for LocalFs {
type RelishStoragePath = PathBuf;
fn derive_destination(
page_server_workdir: &Path,
relish_local_path: &Path,
) -> anyhow::Result<Self::RelishStoragePath> {
Ok(strip_workspace_prefix(page_server_workdir, relish_local_path)?.to_path_buf())
}
async fn list_relishes(&self) -> anyhow::Result<Vec<Self::RelishStoragePath>> {
Ok(get_all_files(&self.root).await?.into_iter().collect())
}
async fn download_relish(
&self,
from: &Self::RelishStoragePath,
to: &Path,
) -> anyhow::Result<()> {
let file_path = self.resolve_in_storage(from)?;
if file_path.exists() && file_path.is_file() {
create_target_directory(to).await?;
tokio::fs::copy(file_path, to).await?;
Ok(())
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
async fn delete_relish(&self, path: &Self::RelishStoragePath) -> anyhow::Result<()> {
let file_path = self.resolve_in_storage(path)?;
if file_path.exists() && file_path.is_file() {
Ok(tokio::fs::remove_file(file_path).await?)
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
async fn upload_relish(&self, from: &Path, to: &Self::RelishStoragePath) -> anyhow::Result<()> {
let target_file_path = self.resolve_in_storage(to)?;
create_target_directory(&target_file_path).await?;
tokio::fs::copy(&from, &target_file_path)
.await
.with_context(|| {
format!(
"Failed to upload relish '{}' to local storage",
from.display(),
)
})?;
Ok(())
}
}
fn get_all_files<'a, P>(
directory_path: P,
) -> Pin<Box<dyn Future<Output = anyhow::Result<Vec<PathBuf>>> + Send + Sync + 'a>>
where
P: AsRef<Path> + Send + Sync + 'a,
{
Box::pin(async move {
let directory_path = directory_path.as_ref();
if directory_path.exists() {
if directory_path.is_dir() {
let mut paths = Vec::new();
let mut dir_contents = tokio::fs::read_dir(directory_path).await?;
while let Some(dir_entry) = dir_contents.next_entry().await? {
let file_type = dir_entry.file_type().await?;
let entry_path = dir_entry.path();
if file_type.is_symlink() {
log::debug!("{:?} us a symlink, skipping", entry_path)
} else if file_type.is_dir() {
paths.extend(get_all_files(entry_path).await?.into_iter())
} else {
paths.push(dir_entry.path());
}
}
Ok(paths)
} else {
bail!("Path '{}' is not a directory", directory_path.display())
}
} else {
Ok(Vec::new())
}
})
}
async fn create_target_directory(target_file_path: &Path) -> anyhow::Result<()> {
let target_dir = match target_file_path.parent() {
Some(parent_dir) => parent_dir,
None => bail!(
"Relish path '{}' has no parent directory",
target_file_path.display()
),
};
if !target_dir.exists() {
tokio::fs::create_dir_all(target_dir).await?;
}
Ok(())
}

View File

@@ -1,144 +0,0 @@
//! A wrapper around AWS S3 client library `rust_s3` to be used a relish storage.
use std::path::Path;
use anyhow::Context;
use s3::{bucket::Bucket, creds::Credentials, region::Region};
use crate::{relish_storage::strip_workspace_prefix, S3Config};
use super::RelishStorage;
const S3_FILE_SEPARATOR: char = '/';
#[derive(Debug)]
pub struct S3ObjectKey(String);
impl S3ObjectKey {
fn key(&self) -> &str {
&self.0
}
}
/// AWS S3 relish storage.
pub struct RustS3 {
bucket: Bucket,
}
impl RustS3 {
/// Creates the relish storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config) -> anyhow::Result<Self> {
let region = aws_config
.bucket_region
.parse::<Region>()
.context("Failed to parse the s3 region from config")?;
let credentials = Credentials::new(
aws_config.access_key_id.as_deref(),
aws_config.secret_access_key.as_deref(),
None,
None,
None,
)
.context("Failed to create the s3 credentials")?;
Ok(Self {
bucket: Bucket::new_with_path_style(
aws_config.bucket_name.as_str(),
region,
credentials,
)
.context("Failed to create the s3 bucket")?,
})
}
}
#[async_trait::async_trait]
impl RelishStorage for RustS3 {
type RelishStoragePath = S3ObjectKey;
fn derive_destination(
page_server_workdir: &Path,
relish_local_path: &Path,
) -> anyhow::Result<Self::RelishStoragePath> {
let relative_path = strip_workspace_prefix(page_server_workdir, relish_local_path)?;
let mut key = String::new();
for segment in relative_path {
key.push(S3_FILE_SEPARATOR);
key.push_str(&segment.to_string_lossy());
}
Ok(S3ObjectKey(key))
}
async fn list_relishes(&self) -> anyhow::Result<Vec<Self::RelishStoragePath>> {
let list_response = self
.bucket
.list(String::new(), None)
.await
.context("Failed to list s3 objects")?;
Ok(list_response
.into_iter()
.flat_map(|response| response.contents)
.map(|s3_object| S3ObjectKey(s3_object.key))
.collect())
}
async fn download_relish(
&self,
from: &Self::RelishStoragePath,
to: &Path,
) -> anyhow::Result<()> {
let mut target_file = std::fs::OpenOptions::new()
.write(true)
.open(to)
.with_context(|| format!("Failed to open target s3 destination at {}", to.display()))?;
let code = self
.bucket
.get_object_stream(from.key(), &mut target_file)
.await
.with_context(|| format!("Failed to download s3 object with key {}", from.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during downloading object from directory, code: {}",
code
))
} else {
Ok(())
}
}
async fn delete_relish(&self, path: &Self::RelishStoragePath) -> anyhow::Result<()> {
let (_, code) = self
.bucket
.delete_object(path.key())
.await
.with_context(|| format!("Failed to delete s3 object with key {}", path.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during deleting object with key '{}', code: {}",
path.key(),
code
))
} else {
Ok(())
}
}
async fn upload_relish(&self, from: &Path, to: &Self::RelishStoragePath) -> anyhow::Result<()> {
let mut local_file = tokio::fs::OpenOptions::new().read(true).open(from).await?;
let code = self
.bucket
.put_object_stream(&mut local_file, to.key())
.await
.with_context(|| format!("Failed to create s3 object with key {}", to.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during creating object with key '{}', code: {}",
to.key(),
code
))
} else {
Ok(())
}
}
}

View File

@@ -1,116 +0,0 @@
use std::{
collections::VecDeque,
path::{Path, PathBuf},
sync::{Arc, Mutex},
thread,
};
use zenith_utils::zid::ZTimelineId;
use crate::{relish_storage::RelishStorage, RelishStorageConfig};
use super::{local_fs::LocalFs, rust_s3::RustS3};
pub struct QueueBasedRelishUploader {
upload_queue: Arc<Mutex<VecDeque<(ZTimelineId, PathBuf)>>>,
}
impl QueueBasedRelishUploader {
pub fn new(
config: &RelishStorageConfig,
page_server_workdir: &'static Path,
) -> anyhow::Result<Self> {
let upload_queue = Arc::new(Mutex::new(VecDeque::new()));
let _handle = match config {
RelishStorageConfig::LocalFs(root) => {
let relish_storage = LocalFs::new(root.clone())?;
create_upload_thread(
Arc::clone(&upload_queue),
relish_storage,
page_server_workdir,
)?
}
RelishStorageConfig::AwsS3(s3_config) => {
let relish_storage = RustS3::new(s3_config)?;
create_upload_thread(
Arc::clone(&upload_queue),
relish_storage,
page_server_workdir,
)?
}
};
Ok(Self { upload_queue })
}
pub fn schedule_upload(&self, timeline_id: ZTimelineId, relish_path: PathBuf) {
self.upload_queue
.lock()
.unwrap()
.push_back((timeline_id, relish_path))
}
}
fn create_upload_thread<P, S: 'static + RelishStorage<RelishStoragePath = P>>(
upload_queue: Arc<Mutex<VecDeque<(ZTimelineId, PathBuf)>>>,
relish_storage: S,
page_server_workdir: &'static Path,
) -> std::io::Result<thread::JoinHandle<()>> {
let runtime = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()?;
thread::Builder::new()
.name("Queue based relish uploader".to_string())
.spawn(move || loop {
runtime.block_on(async {
upload_loop_step(&upload_queue, &relish_storage, page_server_workdir).await;
})
})
}
async fn upload_loop_step<P, S: 'static + RelishStorage<RelishStoragePath = P>>(
upload_queue: &Mutex<VecDeque<(ZTimelineId, PathBuf)>>,
relish_storage: &S,
page_server_workdir: &Path,
) {
let mut queue_accessor = upload_queue.lock().unwrap();
log::debug!("current upload queue length: {}", queue_accessor.len());
let next_upload = queue_accessor.pop_front();
drop(queue_accessor);
let (relish_timeline_id, relish_local_path) = match next_upload {
Some(data) => data,
None => {
// Don't spin and allow others to use the queue.
// In future, could be improved to be more clever about delays depending on relish upload stats
thread::sleep(std::time::Duration::from_secs(1));
return;
}
};
if let Err(e) = upload_relish(relish_storage, page_server_workdir, &relish_local_path).await {
log::error!(
"Failed to upload relish '{}' for timeline {}, reason: {}",
relish_local_path.display(),
relish_timeline_id,
e
);
upload_queue
.lock()
.unwrap()
.push_back((relish_timeline_id, relish_local_path))
} else {
log::debug!("Relish successfully uploaded");
}
}
async fn upload_relish<P, S: RelishStorage<RelishStoragePath = P>>(
relish_storage: &S,
page_server_workdir: &Path,
relish_local_path: &Path,
) -> anyhow::Result<()> {
let destination = S::derive_destination(page_server_workdir, relish_local_path)?;
relish_storage
.upload_relish(relish_local_path, &destination)
.await
}

View File

@@ -0,0 +1,358 @@
//! A set of generic storage abstractions for the page server to use when backing up and restoring its state from the external storage.
//! This particular module serves as a public API border between pageserver and the internal storage machinery.
//! No other modules from this tree are supposed to be used directly by the external code.
//!
//! There are a few components the storage machinery consists of:
//! * [`RemoteStorage`] trait a CRUD-like generic abstraction to use for adapting external storages with a few implementations:
//! * [`local_fs`] allows to use local file system as an external storage
//! * [`rust_s3`] uses AWS S3 bucket as an external storage
//!
//! * synchronization logic at [`storage_sync`] module that keeps pageserver state (both runtime one and the workdir files) and storage state in sync.
//! Synchronization internals are split into submodules
//! * [`storage_sync::compression`] for a custom remote storage format used to store timeline files in archives
//! * [`storage_sync::index`] to keep track of remote tenant files, the metadata and their mappings to local files
//! * [`storage_sync::upload`] and [`storage_sync::download`] to manage archive creation and upload; download and extraction, respectively
//!
//! * public API via to interact with the external world:
//! * [`start_local_timeline_sync`] to launch a background async loop to handle the synchronization
//! * [`schedule_timeline_checkpoint_upload`] and [`schedule_timeline_download`] to enqueue a new upload and download tasks,
//! to be processed by the async loop
//!
//! Here's a schematic overview of all interactions backup and the rest of the pageserver perform:
//!
//! +------------------------+ +--------->-------+
//! | | - - - (init async loop) - - - -> | |
//! | | | |
//! | | -------------------------------> | async |
//! | pageserver | (enqueue timeline sync task) | upload/download |
//! | | | loop |
//! | | <------------------------------- | |
//! | | (apply new timeline sync states) | |
//! +------------------------+ +---------<-------+
//! |
//! |
//! CRUD layer file operations |
//! (upload/download/delete/list, etc.) |
//! V
//! +------------------------+
//! | |
//! | [`RemoteStorage`] impl |
//! | |
//! | pageserver assumes it |
//! | owns exclusive write |
//! | access to this storage |
//! +------------------------+
//!
//! First, during startup, the pageserver inits the storage sync thread with the async loop, or leaves the loop uninitialised, if configured so.
//! The loop inits the storage connection and checks the remote files stored.
//! This is done once at startup only, relying on the fact that pageserver uses the storage alone (ergo, nobody else uploads the files to the storage but this server).
//! Based on the remote storage data, the sync logic immediately schedules sync tasks for local timelines and reports about remote only timelines to pageserver, so it can
//! query their downloads later if they are accessed.
//!
//! Some time later, during pageserver checkpoints, in-memory data is flushed onto disk along with its metadata.
//! If the storage sync loop was successfully started before, pageserver schedules the new checkpoint file uploads after every checkpoint.
//! The checkpoint uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either).
//! See [`crate::layered_repository`] for the upload calls and the adjacent logic.
//!
//! Synchronization logic is able to communicate back with updated timeline sync states, [`TimelineSyncState`],
//! submitted via [`crate::tenant_mgr::set_timeline_states`] function. Tenant manager applies corresponding timeline updates in pageserver's in-memory state.
//! Such submissions happen in two cases:
//! * once after the sync loop startup, to signal pageserver which timelines will be synchronized in the near future
//! * after every loop step, in case a timeline needs to be reloaded or evicted from pageserver's memory
//!
//! When the pageserver terminates, the upload loop finishes a current sync task (if any) and exits.
//!
//! The storage logic considers `image` as a set of local files, fully representing a certain timeline at given moment (identified with `disk_consistent_lsn`).
//! Timeline can change its state, by adding more files on disk and advancing its `disk_consistent_lsn`: this happens after pageserver checkpointing and is followed
//! by the storage upload, if enabled.
//! Yet timeline cannot alter already existing files, and normally cannot remote those too: only a GC process is capable of removing unused files.
//! This way, remote storage synchronization relies on the fact that every checkpoint is incremental and local files are "immutable":
//! * when a certain checkpoint gets uploaded, the sync loop remembers the fact, preventing further reuploads of the same state
//! * no files are deleted from either local or remote storage, only the missing ones locally/remotely get downloaded/uploaded, local metadata file will be overwritten
//! when the newer image is downloaded
//!
//! To optimize S3 storage (and access), the sync loop compresses the checkpoint files before placing them to S3, and uncompresses them back, keeping track of timeline files and metadata.
//! Also, the remote file list is queried once only, at startup, to avoid possible extra costs and latency issues.
//!
//! NOTES:
//! * pageserver assumes it has exclusive write access to the remote storage. If supported, the way multiple pageservers can be separated in the same storage
//! (i.e. using different directories in the local filesystem external storage), but totally up to the storage implementation and not covered with the trait API.
//!
//! * the sync tasks may not processed immediately after the submission: if they error and get re-enqueued, their execution might be backed off to ensure error cap is not exceeded too fast.
//! The sync queue processing also happens in batches, so the sync tasks can wait in the queue for some time.
mod local_fs;
mod rust_s3;
mod storage_sync;
use std::{
collections::HashMap,
ffi, fs,
path::{Path, PathBuf},
};
use anyhow::{bail, Context};
use tokio::io;
use tracing::{error, info};
use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId};
pub use self::storage_sync::{schedule_timeline_checkpoint_upload, schedule_timeline_download};
use self::{local_fs::LocalFs, rust_s3::S3};
use crate::{
config::{PageServerConf, RemoteStorageKind},
layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME},
repository::TimelineSyncState,
};
pub use storage_sync::compression;
/// A structure to combine all synchronization data to share with pageserver after a successful sync loop initialization.
/// Successful initialization includes a case when sync loop is not started, in which case the startup data is returned still,
/// to simplify the received code.
pub struct SyncStartupData {
/// A sync state, derived from initial comparison of local timeline files and the remote archives,
/// before any sync tasks are executed.
/// To reuse the local file scan logic, the timeline states are returned even if no sync loop get started during init:
/// in this case, no remote files exist and all local timelines with correct metadata files are considered ready.
pub initial_timeline_states: HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>>,
}
/// Based on the config, initiates the remote storage connection and starts a separate thread
/// that ensures that pageserver and the remote storage are in sync with each other.
/// If no external configuration connection given, no thread or storage initialization is done.
/// Along with that, scans tenant files local and remote (if the sync gets enabled) to check the initial timeline states.
pub fn start_local_timeline_sync(
config: &'static PageServerConf,
) -> anyhow::Result<SyncStartupData> {
let local_timeline_files = local_tenant_timeline_files(config)
.context("Failed to collect local tenant timeline files")?;
match &config.remote_storage_config {
Some(storage_config) => match &storage_config.storage {
RemoteStorageKind::LocalFs(root) => {
info!("Using fs root '{}' as a remote storage", root.display());
storage_sync::spawn_storage_sync_thread(
config,
local_timeline_files,
LocalFs::new(root.clone(), &config.workdir)?,
storage_config.max_concurrent_sync,
storage_config.max_sync_errors,
)
},
RemoteStorageKind::AwsS3(s3_config) => {
info!("Using s3 bucket '{}' in region '{}' as a remote storage, prefix in bucket: '{:?}', bucket endpoint: '{:?}'",
s3_config.bucket_name, s3_config.bucket_region, s3_config.prefix_in_bucket, s3_config.endpoint);
storage_sync::spawn_storage_sync_thread(
config,
local_timeline_files,
S3::new(s3_config, &config.workdir)?,
storage_config.max_concurrent_sync,
storage_config.max_sync_errors,
)
},
}
.context("Failed to spawn the storage sync thread"),
None => {
info!("No remote storage configured, skipping storage sync, considering all local timelines with correct metadata files enabled");
let mut initial_timeline_states: HashMap<
ZTenantId,
HashMap<ZTimelineId, TimelineSyncState>,
> = HashMap::new();
for (ZTenantTimelineId{tenant_id, timeline_id}, (timeline_metadata, _)) in
local_timeline_files
{
initial_timeline_states
.entry(tenant_id)
.or_default()
.insert(
timeline_id,
TimelineSyncState::Ready(timeline_metadata.disk_consistent_lsn()),
);
}
Ok(SyncStartupData {
initial_timeline_states,
})
}
}
}
fn local_tenant_timeline_files(
config: &'static PageServerConf,
) -> anyhow::Result<HashMap<ZTenantTimelineId, (TimelineMetadata, Vec<PathBuf>)>> {
let mut local_tenant_timeline_files = HashMap::new();
let tenants_dir = config.tenants_path();
for tenants_dir_entry in fs::read_dir(&tenants_dir)
.with_context(|| format!("Failed to list tenants dir {}", tenants_dir.display()))?
{
match &tenants_dir_entry {
Ok(tenants_dir_entry) => {
match collect_timelines_for_tenant(config, &tenants_dir_entry.path()) {
Ok(collected_files) => {
local_tenant_timeline_files.extend(collected_files.into_iter())
}
Err(e) => error!(
"Failed to collect tenant files from dir '{}' for entry {:?}, reason: {:#}",
tenants_dir.display(),
tenants_dir_entry,
e
),
}
}
Err(e) => error!(
"Failed to list tenants dir entry {:?} in directory {}, reason: {:?}",
tenants_dir_entry,
tenants_dir.display(),
e
),
}
}
Ok(local_tenant_timeline_files)
}
fn collect_timelines_for_tenant(
config: &'static PageServerConf,
tenant_path: &Path,
) -> anyhow::Result<HashMap<ZTenantTimelineId, (TimelineMetadata, Vec<PathBuf>)>> {
let mut timelines: HashMap<ZTenantTimelineId, (TimelineMetadata, Vec<PathBuf>)> =
HashMap::new();
let tenant_id = tenant_path
.file_name()
.and_then(ffi::OsStr::to_str)
.unwrap_or_default()
.parse::<ZTenantId>()
.context("Could not parse tenant id out of the tenant dir name")?;
let timelines_dir = config.timelines_path(&tenant_id);
for timelines_dir_entry in fs::read_dir(&timelines_dir).with_context(|| {
format!(
"Failed to list timelines dir entry for tenant {}",
tenant_id
)
})? {
match timelines_dir_entry {
Ok(timelines_dir_entry) => {
let timeline_path = timelines_dir_entry.path();
match collect_timeline_files(&timeline_path) {
Ok((timeline_id, metadata, timeline_files)) => {
timelines.insert(
ZTenantTimelineId {
tenant_id,
timeline_id,
},
(metadata, timeline_files),
);
}
Err(e) => error!(
"Failed to process timeline dir contents at '{}', reason: {:?}",
timeline_path.display(),
e
),
}
}
Err(e) => error!(
"Failed to list timelines for entry tenant {}, reason: {:?}",
tenant_id, e
),
}
}
Ok(timelines)
}
fn collect_timeline_files(
timeline_dir: &Path,
) -> anyhow::Result<(ZTimelineId, TimelineMetadata, Vec<PathBuf>)> {
let mut timeline_files = Vec::new();
let mut timeline_metadata_path = None;
let timeline_id = timeline_dir
.file_name()
.and_then(ffi::OsStr::to_str)
.unwrap_or_default()
.parse::<ZTimelineId>()
.context("Could not parse timeline id out of the timeline dir name")?;
let timeline_dir_entries =
fs::read_dir(&timeline_dir).context("Failed to list timeline dir contents")?;
for entry in timeline_dir_entries {
let entry_path = entry.context("Failed to list timeline dir entry")?.path();
if entry_path.is_file() {
if entry_path.file_name().and_then(ffi::OsStr::to_str) == Some(METADATA_FILE_NAME) {
timeline_metadata_path = Some(entry_path);
} else {
timeline_files.push(entry_path);
}
}
}
let timeline_metadata_path = match timeline_metadata_path {
Some(path) => path,
None => bail!("No metadata file found in the timeline directory"),
};
let metadata = TimelineMetadata::from_bytes(
&fs::read(&timeline_metadata_path).context("Failed to read timeline metadata file")?,
)
.context("Failed to parse timeline metadata file bytes")?;
Ok((timeline_id, metadata, timeline_files))
}
/// Storage (potentially remote) API to manage its state.
/// This storage tries to be unaware of any layered repository context,
/// providing basic CRUD operations for storage files.
#[async_trait::async_trait]
trait RemoteStorage: Send + Sync {
/// A way to uniquely reference a file in the remote storage.
type StoragePath;
/// Attempts to derive the storage path out of the local path, if the latter is correct.
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath>;
/// Gets the download path of the given storage file.
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf>;
/// Lists all items the storage has right now.
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>>;
/// Streams the local file contents into remote into the remote storage entry.
async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
) -> anyhow::Result<()>;
/// Streams the remote storage entry contents into the buffered writer given, returns the filled writer.
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()>;
/// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer.
async fn download_range(
&self,
from: &Self::StoragePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()>;
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()>;
}
fn strip_path_prefix<'a>(prefix: &'a Path, path: &'a Path) -> anyhow::Result<&'a Path> {
if prefix == path {
anyhow::bail!(
"Prefix and the path are equal, cannot strip: '{}'",
prefix.display()
)
} else {
path.strip_prefix(prefix).with_context(|| {
format!(
"Path '{}' is not prefixed with '{}'",
path.display(),
prefix.display(),
)
})
}
}

View File

@@ -0,0 +1,72 @@
# Non-implementation details
This document describes the current state of the backup system in pageserver, existing limitations and concerns, why some things are done the way they are the future development plans.
Detailed description on how the synchronization works and how it fits into the rest of the pageserver can be found in the [storage module](./../remote_storage.rs) and its submodules.
Ideally, this document should disappear after current implementation concerns are mitigated, with the remaining useful knowledge bits moved into rustdocs.
## Approach
Backup functionality is a new component, appeared way after the core DB functionality was implemented.
Pageserver layer functionality is also quite volatile at the moment, there's a risk its local file management changes over time.
To avoid adding more chaos into that, backup functionality is currently designed as a relatively standalone component, with the majority of its logic placed in a standalone async loop.
This way, the backups are managed in background, not affecting directly other pageserver parts: this way the backup and restoration process may lag behind, but eventually keep up with the reality. To track that, a set of prometheus metrics is exposed from pageserver.
## What's done
Current implementation
* provides remote storage wrappers for AWS S3 and local FS
* synchronizes the differences with local timelines and remote states as fast as possible
* uploads new relishes, frozen by pageserver checkpoint thread
* downloads and registers timelines, found on the remote storage, but missing locally, if those are requested somehow via pageserver (e.g. http api, gc)
* uses compression when deals with files, for better S3 usage
* maintains an index of what's stored remotely
* evicts failing tasks and stops the corresponding timelines
The tasks are delayed with every retry and the retries are capped, to avoid poisonous tasks.
After any task eviction, or any error at startup checks (e.g. obviously different and wrong local and remote states fot the same timeline),
the timeline has to be stopped from submitting further checkpoint upload tasks, which is done along the corresponding timeline status change.
No good optimisations or performance testing is done, the feature is disabled by default and gets polished over time.
It's planned to deal with all questions that are currently on and prepare the feature to be enabled by default in cloud environments.
### Peculiarities
As mentioned, the backup component is rather new and under development currently, so not all things are done properly from the start.
Here's the list of known compromises with comments:
* Remote storage file model is currently a custom archive format, that's not possible to deserialize without a particular Rust code of ours (including `serde`).
We also don't optimize the archivation and pack every timeline checkpoint separately, so the resulting blob's size that gets on S3 could be arbitrary.
But, it's a single blob, which is way better than storing ~780 small files separately.
* Archive index restoration requires reading every blob's head.
This could be avoided by a background thread/future storing the serialized index in the remote storage.
* no proper file comparison
No file checksum assertion is done currently, but should be (AWS S3 returns file checksums during the `list` operation)
* sad rust-s3 api
rust-s3 is not very pleasant to use:
1. it returns `anyhow::Result` and it's hard to distinguish "missing file" cases from "no connection" one, for instance
2. at least one function it its API that we need (`get_object_stream`) has `async` keyword and blocks (!), see details [here](https://github.com/zenithdb/zenith/pull/752#discussion_r728373091)
3. it's a prerelease library with unclear maintenance status
4. noisy on debug level
But it's already used in the project, so for now it's reused to avoid bloating the dependency tree.
Based on previous evaluation, even `rusoto-s3` could be a better choice over this library, but needs further benchmarking.
* gc is ignored
So far, we don't adjust the remote storage based on GC thread loop results, only checkpointer loop affects the remote storage.
Index module could be used as a base to implement a deferred GC mechanism, a "defragmentation" that repacks archives into new ones after GC is done removing the files from the archives.
* bracnhes implementaion could be improved
Currently, there's a code to sync the branches along with the timeline files: on upload, every local branch files that are missing remotely are uploaded,
on the timeline download, missing remote branch files are downlaoded.
A branch is a per-tenant entity, yet a current implementaion requires synchronizing a timeline first to get the branch files locally.
Currently, there's no other way to know about the remote branch files, neither the file contents is verified and updated.

View File

@@ -0,0 +1,689 @@
//! Local filesystem acting as a remote storage.
//! Multiple pageservers can use the same "storage" of this kind by using different storage roots.
//!
//! This storage used in pageserver tests, but can also be used in cases when a certain persistent
//! volume is mounted to the local FS.
use std::{
future::Future,
path::{Path, PathBuf},
pin::Pin,
};
use anyhow::{bail, ensure, Context};
use tokio::{
fs,
io::{self, AsyncReadExt, AsyncSeekExt, AsyncWriteExt},
};
use tracing::*;
use super::{strip_path_prefix, RemoteStorage};
pub struct LocalFs {
pageserver_workdir: &'static Path,
root: PathBuf,
}
impl LocalFs {
/// Attempts to create local FS storage, along with its root directory.
pub fn new(root: PathBuf, pageserver_workdir: &'static Path) -> anyhow::Result<Self> {
if !root.exists() {
std::fs::create_dir_all(&root).with_context(|| {
format!(
"Failed to create all directories in the given root path '{}'",
root.display(),
)
})?;
}
Ok(Self {
pageserver_workdir,
root,
})
}
fn resolve_in_storage(&self, path: &Path) -> anyhow::Result<PathBuf> {
if path.is_relative() {
Ok(self.root.join(path))
} else if path.starts_with(&self.root) {
Ok(path.to_path_buf())
} else {
bail!(
"Path '{}' does not belong to the current storage",
path.display()
)
}
}
}
#[async_trait::async_trait]
impl RemoteStorage for LocalFs {
type StoragePath = PathBuf;
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath> {
Ok(self.root.join(
strip_path_prefix(self.pageserver_workdir, local_path)
.context("local path does not belong to this storage")?,
))
}
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf> {
let relative_path = strip_path_prefix(&self.root, storage_path)
.context("local path does not belong to this storage")?;
Ok(self.pageserver_workdir.join(relative_path))
}
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>> {
get_all_files(&self.root).await
}
async fn upload(
&self,
mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
) -> anyhow::Result<()> {
let target_file_path = self.resolve_in_storage(to)?;
create_target_directory(&target_file_path).await?;
let mut destination = io::BufWriter::new(
fs::OpenOptions::new()
.write(true)
.create(true)
.open(&target_file_path)
.await
.with_context(|| {
format!(
"Failed to open target fs destination at '{}'",
target_file_path.display()
)
})?,
);
io::copy(&mut from, &mut destination)
.await
.with_context(|| {
format!(
"Failed to upload file to the local storage at '{}'",
target_file_path.display()
)
})?;
destination.flush().await.with_context(|| {
format!(
"Failed to upload file to the local storage at '{}'",
target_file_path.display()
)
})?;
Ok(())
}
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
let file_path = self.resolve_in_storage(from)?;
if file_path.exists() && file_path.is_file() {
let mut source = io::BufReader::new(
fs::OpenOptions::new()
.read(true)
.open(&file_path)
.await
.with_context(|| {
format!(
"Failed to open source file '{}' to use in the download",
file_path.display()
)
})?,
);
io::copy(&mut source, to).await.with_context(|| {
format!(
"Failed to download file '{}' from the local storage",
file_path.display()
)
})?;
source.flush().await?;
Ok(())
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
async fn download_range(
&self,
from: &Self::StoragePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
if let Some(end_exclusive) = end_exclusive {
ensure!(
end_exclusive > start_inclusive,
"Invalid range, start ({}) is bigger then end ({:?})",
start_inclusive,
end_exclusive
);
if start_inclusive == end_exclusive.saturating_sub(1) {
return Ok(());
}
}
let file_path = self.resolve_in_storage(from)?;
if file_path.exists() && file_path.is_file() {
let mut source = io::BufReader::new(
fs::OpenOptions::new()
.read(true)
.open(&file_path)
.await
.with_context(|| {
format!(
"Failed to open source file '{}' to use in the download",
file_path.display()
)
})?,
);
source
.seek(io::SeekFrom::Start(start_inclusive))
.await
.context("Failed to seek to the range start in a local storage file")?;
match end_exclusive {
Some(end_exclusive) => {
io::copy(&mut source.take(end_exclusive - start_inclusive), to).await
}
None => io::copy(&mut source, to).await,
}
.with_context(|| {
format!(
"Failed to download file '{}' range from the local storage",
file_path.display()
)
})?;
Ok(())
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> {
let file_path = self.resolve_in_storage(path)?;
if file_path.exists() && file_path.is_file() {
Ok(fs::remove_file(file_path).await?)
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
}
fn get_all_files<'a, P>(
directory_path: P,
) -> Pin<Box<dyn Future<Output = anyhow::Result<Vec<PathBuf>>> + Send + Sync + 'a>>
where
P: AsRef<Path> + Send + Sync + 'a,
{
Box::pin(async move {
let directory_path = directory_path.as_ref();
if directory_path.exists() {
if directory_path.is_dir() {
let mut paths = Vec::new();
let mut dir_contents = fs::read_dir(directory_path).await?;
while let Some(dir_entry) = dir_contents.next_entry().await? {
let file_type = dir_entry.file_type().await?;
let entry_path = dir_entry.path();
if file_type.is_symlink() {
debug!("{:?} us a symlink, skipping", entry_path)
} else if file_type.is_dir() {
paths.extend(get_all_files(entry_path).await?.into_iter())
} else {
paths.push(dir_entry.path());
}
}
Ok(paths)
} else {
bail!("Path '{}' is not a directory", directory_path.display())
}
} else {
Ok(Vec::new())
}
})
}
async fn create_target_directory(target_file_path: &Path) -> anyhow::Result<()> {
let target_dir = match target_file_path.parent() {
Some(parent_dir) => parent_dir,
None => bail!(
"File path '{}' has no parent directory",
target_file_path.display()
),
};
if !target_dir.exists() {
fs::create_dir_all(target_dir).await?;
}
Ok(())
}
#[cfg(test)]
mod pure_tests {
use crate::{
layered_repository::metadata::METADATA_FILE_NAME,
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use super::*;
#[test]
fn storage_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("storage_path_positive")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root.clone(),
};
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("file_name");
let expected_path = storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?);
assert_eq!(
expected_path,
storage.storage_path(&local_path).expect("Matching path should map to storage path normally"),
"File paths from pageserver workdir should be stored in local fs storage with the same path they have relative to the workdir"
);
Ok(())
}
#[test]
fn storage_path_negatives() -> anyhow::Result<()> {
#[track_caller]
fn storage_path_error(storage: &LocalFs, mismatching_path: &Path) -> String {
match storage.storage_path(mismatching_path) {
Ok(wrong_path) => panic!(
"Expected path '{}' to error, but got storage path: {:?}",
mismatching_path.display(),
wrong_path,
),
Err(e) => format!("{:?}", e),
}
}
let repo_harness = RepoHarness::create("storage_path_negatives")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
};
let error_string = storage_path_error(&storage, &repo_harness.conf.workdir);
assert!(error_string.contains("does not belong to this storage"));
assert!(error_string.contains(repo_harness.conf.workdir.to_str().unwrap()));
let mismatching_path_str = "/something/else";
let error_message = storage_path_error(&storage, Path::new(mismatching_path_str));
assert!(
error_message.contains(mismatching_path_str),
"Error should mention wrong path"
);
assert!(
error_message.contains(repo_harness.conf.workdir.to_str().unwrap()),
"Error should mention server workdir"
);
assert!(error_message.contains("does not belong to this storage"));
Ok(())
}
#[test]
fn local_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("local_path_positive")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root.clone(),
};
let name = "not a metadata";
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join(name);
assert_eq!(
local_path,
storage
.local_path(
&storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?)
)
.expect("For a valid input, valid local path should be parsed"),
"Should be able to parse metadata out of the correctly named remote delta file"
);
let local_metadata_path = repo_harness
.timeline_path(&TIMELINE_ID)
.join(METADATA_FILE_NAME);
let remote_metadata_path = storage.storage_path(&local_metadata_path)?;
assert_eq!(
local_metadata_path,
storage
.local_path(&remote_metadata_path)
.expect("For a valid input, valid local path should be parsed"),
"Should be able to parse metadata out of the correctly named remote metadata file"
);
Ok(())
}
#[test]
fn local_path_negatives() -> anyhow::Result<()> {
#[track_caller]
#[allow(clippy::ptr_arg)] // have to use &PathBuf due to `storage.local_path` parameter requirements
fn local_path_error(storage: &LocalFs, storage_path: &PathBuf) -> String {
match storage.local_path(storage_path) {
Ok(wrong_path) => panic!(
"Expected local path input {:?} to cause an error, but got file path: {:?}",
storage_path, wrong_path,
),
Err(e) => format!("{:?}", e),
}
}
let repo_harness = RepoHarness::create("local_path_negatives")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
};
let totally_wrong_path = "wrong_wrong_wrong";
let error_message = local_path_error(&storage, &PathBuf::from(totally_wrong_path));
assert!(error_message.contains(totally_wrong_path));
Ok(())
}
#[test]
fn download_destination_matches_original_path() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination_matches_original_path")?;
let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name");
let storage_root = PathBuf::from("somewhere").join("else");
let dummy_storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
};
let storage_path = dummy_storage.storage_path(&original_path)?;
let download_destination = dummy_storage.local_path(&storage_path)?;
assert_eq!(
original_path, download_destination,
"'original path -> storage path -> matching fs path' transformation should produce the same path as the input one for the correct path"
);
Ok(())
}
}
#[cfg(test)]
mod fs_tests {
use super::*;
use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID};
use std::io::Write;
use tempfile::tempdir;
#[tokio::test]
async fn upload_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("upload_file")?;
let storage = create_storage()?;
let source = create_file_for_upload(
&storage.pageserver_workdir.join("whatever"),
"whatever_contents",
)
.await?;
let target_path = PathBuf::from("/").join("somewhere").join("else");
match storage.upload(source, &target_path).await {
Ok(()) => panic!("Should not allow storing files with wrong target path"),
Err(e) => {
let message = format!("{:?}", e);
assert!(message.contains(&target_path.display().to_string()));
assert!(message.contains("does not belong to the current storage"));
}
}
assert!(storage.list().await?.is_empty());
let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1").await?;
assert_eq!(
storage.list().await?,
vec![target_path_1.clone()],
"Should list a single file after first upload"
);
let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2").await?;
assert_eq!(
list_files_sorted(&storage).await?,
vec![target_path_1.clone(), target_path_2.clone()],
"Should list a two different files after second upload"
);
Ok(())
}
fn create_storage() -> anyhow::Result<LocalFs> {
let pageserver_workdir = Box::leak(Box::new(tempdir()?.path().to_owned()));
let storage = LocalFs::new(tempdir()?.path().to_owned(), pageserver_workdir)?;
Ok(storage)
}
#[tokio::test]
async fn download_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage.download(&upload_target, &mut content_bytes).await?;
content_bytes.flush().await?;
let contents = String::from_utf8(content_bytes.into_inner().into_inner())?;
assert_eq!(
dummy_contents(upload_name),
contents,
"We should upload and download the same contents"
);
let non_existing_path = PathBuf::from("somewhere").join("else");
match storage.download(&non_existing_path, &mut io::sink()).await {
Ok(_) => panic!("Should not allow downloading non-existing storage files"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("does not exist"));
assert!(error_string.contains(&non_existing_path.display().to_string()));
}
}
Ok(())
}
#[tokio::test]
async fn download_file_range_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file_range_positive")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let mut full_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
.download_range(&upload_target, 0, None, &mut full_range_bytes)
.await?;
full_range_bytes.flush().await?;
assert_eq!(
dummy_contents(upload_name),
String::from_utf8(full_range_bytes.into_inner().into_inner())?,
"Download full range should return the whole upload"
);
let mut zero_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let same_byte = 1_000_000_000;
storage
.download_range(
&upload_target,
same_byte,
Some(same_byte + 1), // exclusive end
&mut zero_range_bytes,
)
.await?;
zero_range_bytes.flush().await?;
assert!(
zero_range_bytes.into_inner().into_inner().is_empty(),
"Zero byte range should not download any part of the file"
);
let uploaded_bytes = dummy_contents(upload_name).into_bytes();
let (first_part_local, second_part_local) = uploaded_bytes.split_at(3);
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
.download_range(
&upload_target,
0,
Some(first_part_local.len() as u64),
&mut first_part_remote,
)
.await?;
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
assert_eq!(
first_part_local,
first_part_remote.as_slice(),
"First part bytes should be returned when requested"
);
let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
.download_range(
&upload_target,
first_part_local.len() as u64,
Some((first_part_local.len() + second_part_local.len()) as u64),
&mut second_part_remote,
)
.await?;
second_part_remote.flush().await?;
let second_part_remote = second_part_remote.into_inner().into_inner();
assert_eq!(
second_part_local,
second_part_remote.as_slice(),
"Second part bytes should be returned when requested"
);
Ok(())
}
#[tokio::test]
async fn download_file_range_negative() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file_range_negative")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let start = 10000;
let end = 234;
assert!(start > end, "Should test an incorrect range");
match storage
.download_range(&upload_target, start, Some(end), &mut io::sink())
.await
{
Ok(_) => panic!("Should not allow downloading wrong ranges"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("Invalid range"));
assert!(error_string.contains(&start.to_string()));
assert!(error_string.contains(&end.to_string()));
}
}
let non_existing_path = PathBuf::from("somewhere").join("else");
match storage
.download_range(&non_existing_path, 1, Some(3), &mut io::sink())
.await
{
Ok(_) => panic!("Should not allow downloading non-existing storage file ranges"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("does not exist"));
assert!(error_string.contains(&non_existing_path.display().to_string()));
}
}
Ok(())
}
#[tokio::test]
async fn delete_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("delete_file")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
storage.delete(&upload_target).await?;
assert!(storage.list().await?.is_empty());
match storage.delete(&upload_target).await {
Ok(()) => panic!("Should not allow deleting non-existing storage files"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("does not exist"));
assert!(error_string.contains(&upload_target.display().to_string()));
}
}
Ok(())
}
async fn upload_dummy_file(
harness: &RepoHarness,
storage: &LocalFs,
name: &str,
) -> anyhow::Result<PathBuf> {
let timeline_path = harness.timeline_path(&TIMELINE_ID);
let relative_timeline_path = timeline_path.strip_prefix(&harness.conf.workdir)?;
let storage_path = storage.root.join(relative_timeline_path).join(name);
storage
.upload(
create_file_for_upload(
&storage.pageserver_workdir.join(name),
&dummy_contents(name),
)
.await?,
&storage_path,
)
.await?;
Ok(storage_path)
}
async fn create_file_for_upload(
path: &Path,
contents: &str,
) -> anyhow::Result<io::BufReader<fs::File>> {
std::fs::create_dir_all(path.parent().unwrap())?;
let mut file_for_writing = std::fs::OpenOptions::new()
.write(true)
.create_new(true)
.open(path)?;
write!(file_for_writing, "{}", contents)?;
drop(file_for_writing);
Ok(io::BufReader::new(
fs::OpenOptions::new().read(true).open(&path).await?,
))
}
fn dummy_contents(name: &str) -> String {
format!("contents for {}", name)
}
async fn list_files_sorted(storage: &LocalFs) -> anyhow::Result<Vec<PathBuf>> {
let mut files = storage.list().await?;
files.sort();
Ok(files)
}
}

View File

@@ -0,0 +1,438 @@
//! AWS S3 storage wrapper around `rust_s3` library.
//!
//! Respects `prefix_in_bucket` property from [`S3Config`],
//! allowing multiple pageservers to independently work with the same S3 bucket, if
//! their bucket prefixes are both specified and different.
use std::path::{Path, PathBuf};
use anyhow::Context;
use s3::{bucket::Bucket, creds::Credentials, region::Region};
use tokio::io::{self, AsyncWriteExt};
use tracing::debug;
use crate::{
config::S3Config,
remote_storage::{strip_path_prefix, RemoteStorage},
};
const S3_FILE_SEPARATOR: char = '/';
#[derive(Debug, Eq, PartialEq)]
pub struct S3ObjectKey(String);
impl S3ObjectKey {
fn key(&self) -> &str {
&self.0
}
fn download_destination(
&self,
pageserver_workdir: &Path,
prefix_to_strip: Option<&str>,
) -> PathBuf {
let path_without_prefix = match prefix_to_strip {
Some(prefix) => self.0.strip_prefix(prefix).unwrap_or_else(|| {
panic!(
"Could not strip prefix '{}' from S3 object key '{}'",
prefix, self.0
)
}),
None => &self.0,
};
pageserver_workdir.join(
path_without_prefix
.split(S3_FILE_SEPARATOR)
.collect::<PathBuf>(),
)
}
}
/// AWS S3 storage.
pub struct S3 {
pageserver_workdir: &'static Path,
bucket: Bucket,
prefix_in_bucket: Option<String>,
}
impl S3 {
/// Creates the storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config, pageserver_workdir: &'static Path) -> anyhow::Result<Self> {
debug!(
"Creating s3 remote storage around bucket {}",
aws_config.bucket_name
);
let region = match aws_config.endpoint.clone() {
Some(endpoint) => Region::Custom {
endpoint,
region: aws_config.bucket_region.clone(),
},
None => aws_config
.bucket_region
.parse::<Region>()
.context("Failed to parse the s3 region from config")?,
};
let credentials = Credentials::new(
aws_config.access_key_id.as_deref(),
aws_config.secret_access_key.as_deref(),
None,
None,
None,
)
.context("Failed to create the s3 credentials")?;
let prefix_in_bucket = aws_config.prefix_in_bucket.as_deref().map(|prefix| {
let mut prefix = prefix;
while prefix.starts_with(S3_FILE_SEPARATOR) {
prefix = &prefix[1..]
}
let mut prefix = prefix.to_string();
while prefix.ends_with(S3_FILE_SEPARATOR) {
prefix.pop();
}
prefix
});
Ok(Self {
bucket: Bucket::new_with_path_style(
aws_config.bucket_name.as_str(),
region,
credentials,
)
.context("Failed to create the s3 bucket")?,
pageserver_workdir,
prefix_in_bucket,
})
}
}
#[async_trait::async_trait]
impl RemoteStorage for S3 {
type StoragePath = S3ObjectKey;
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath> {
let relative_path = strip_path_prefix(self.pageserver_workdir, local_path)?;
let mut key = self.prefix_in_bucket.clone().unwrap_or_default();
for segment in relative_path {
key.push(S3_FILE_SEPARATOR);
key.push_str(&segment.to_string_lossy());
}
Ok(S3ObjectKey(key))
}
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf> {
Ok(storage_path
.download_destination(self.pageserver_workdir, self.prefix_in_bucket.as_deref()))
}
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>> {
let list_response = self
.bucket
.list(self.prefix_in_bucket.clone().unwrap_or_default(), None)
.await
.context("Failed to list s3 objects")?;
Ok(list_response
.into_iter()
.flat_map(|response| response.contents)
.map(|s3_object| S3ObjectKey(s3_object.key))
.collect())
}
async fn upload(
&self,
mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
) -> anyhow::Result<()> {
let mut upload_contents = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
io::copy(&mut from, &mut upload_contents)
.await
.context("Failed to read the upload contents")?;
upload_contents
.flush()
.await
.context("Failed to read the upload contents")?;
let upload_contents = upload_contents.into_inner().into_inner();
let (_, code) = self
.bucket
.put_object(to.key(), &upload_contents)
.await
.with_context(|| format!("Failed to create s3 object with key {}", to.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during creating object with key '{}', code: {}",
to.key(),
code
))
} else {
Ok(())
}
}
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
let (data, code) = self
.bucket
.get_object(from.key())
.await
.with_context(|| format!("Failed to download s3 object with key {}", from.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during downloading object, code: {}",
code
))
} else {
// we don't have to write vector into the destination this way, `to_write_all` would be enough.
// but we want to prepare for migration on `rusoto`, that has a streaming HTTP body instead here, with
// which it makes more sense to use `io::copy`.
io::copy(&mut data.as_slice(), to)
.await
.context("Failed to write downloaded data into the destination buffer")?;
Ok(())
}
}
async fn download_range(
&self,
from: &Self::StoragePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
// S3 accepts ranges as https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
// and needs both ends to be exclusive
let end_inclusive = end_exclusive.map(|end| end.saturating_sub(1));
let (data, code) = self
.bucket
.get_object_range(from.key(), start_inclusive, end_inclusive)
.await
.with_context(|| format!("Failed to download s3 object with key {}", from.key()))?;
if code != 206 {
Err(anyhow::format_err!(
"Received non-206 exit code during downloading object range, code: {}",
code
))
} else {
// see `download` function above for the comment on why `Vec<u8>` buffer is copied this way
io::copy(&mut data.as_slice(), to)
.await
.context("Failed to write downloaded range into the destination buffer")?;
Ok(())
}
}
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> {
let (_, code) = self
.bucket
.delete_object(path.key())
.await
.with_context(|| format!("Failed to delete s3 object with key {}", path.key()))?;
if code != 204 {
Err(anyhow::format_err!(
"Received non-204 exit code during deleting object with key '{}', code: {}",
path.key(),
code
))
} else {
Ok(())
}
}
}
#[cfg(test)]
mod tests {
use crate::{
layered_repository::metadata::METADATA_FILE_NAME,
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use super::*;
#[test]
fn download_destination() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination")?;
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("test_name");
let relative_path = local_path.strip_prefix(&repo_harness.conf.workdir)?;
let key = S3ObjectKey(format!(
"{}{}",
S3_FILE_SEPARATOR,
relative_path
.iter()
.map(|segment| segment.to_str().unwrap())
.collect::<Vec<_>>()
.join(&S3_FILE_SEPARATOR.to_string()),
));
assert_eq!(
local_path,
key.download_destination(&repo_harness.conf.workdir, None),
"Download destination should consist of s3 path joined with the pageserver workdir prefix"
);
Ok(())
}
#[test]
fn storage_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("storage_path_positive")?;
let segment_1 = "matching";
let segment_2 = "file";
let local_path = &repo_harness.conf.workdir.join(segment_1).join(segment_2);
let storage = dummy_storage(&repo_harness.conf.workdir);
let expected_key = S3ObjectKey(format!(
"{}{SEPARATOR}{}{SEPARATOR}{}",
storage.prefix_in_bucket.as_deref().unwrap_or_default(),
segment_1,
segment_2,
SEPARATOR = S3_FILE_SEPARATOR,
));
let actual_key = storage
.storage_path(local_path)
.expect("Matching path should map to S3 path normally");
assert_eq!(
expected_key,
actual_key,
"S3 key from the matching path should contain all segments after the workspace prefix, separated with S3 separator"
);
Ok(())
}
#[test]
fn storage_path_negatives() -> anyhow::Result<()> {
#[track_caller]
fn storage_path_error(storage: &S3, mismatching_path: &Path) -> String {
match storage.storage_path(mismatching_path) {
Ok(wrong_key) => panic!(
"Expected path '{}' to error, but got S3 key: {:?}",
mismatching_path.display(),
wrong_key,
),
Err(e) => e.to_string(),
}
}
let repo_harness = RepoHarness::create("storage_path_negatives")?;
let storage = dummy_storage(&repo_harness.conf.workdir);
let error_message = storage_path_error(&storage, &repo_harness.conf.workdir);
assert!(
error_message.contains("Prefix and the path are equal"),
"Message '{}' does not contain the required string",
error_message
);
let mismatching_path = PathBuf::from("somewhere").join("else");
let error_message = storage_path_error(&storage, &mismatching_path);
assert!(
error_message.contains(mismatching_path.to_str().unwrap()),
"Error should mention wrong path"
);
assert!(
error_message.contains(repo_harness.conf.workdir.to_str().unwrap()),
"Error should mention server workdir"
);
assert!(
error_message.contains("is not prefixed with"),
"Message '{}' does not contain a required string",
error_message
);
Ok(())
}
#[test]
fn local_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("local_path_positive")?;
let storage = dummy_storage(&repo_harness.conf.workdir);
let timeline_dir = repo_harness.timeline_path(&TIMELINE_ID);
let relative_timeline_path = timeline_dir.strip_prefix(&repo_harness.conf.workdir)?;
let s3_key = create_s3_key(
&relative_timeline_path.join("not a metadata"),
storage.prefix_in_bucket.as_deref(),
);
assert_eq!(
s3_key.download_destination(
&repo_harness.conf.workdir,
storage.prefix_in_bucket.as_deref()
),
storage
.local_path(&s3_key)
.expect("For a valid input, valid S3 info should be parsed"),
"Should be able to parse metadata out of the correctly named remote delta file"
);
let s3_key = create_s3_key(
&relative_timeline_path.join(METADATA_FILE_NAME),
storage.prefix_in_bucket.as_deref(),
);
assert_eq!(
s3_key.download_destination(
&repo_harness.conf.workdir,
storage.prefix_in_bucket.as_deref()
),
storage
.local_path(&s3_key)
.expect("For a valid input, valid S3 info should be parsed"),
"Should be able to parse metadata out of the correctly named remote metadata file"
);
Ok(())
}
#[test]
fn download_destination_matches_original_path() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination_matches_original_path")?;
let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name");
let dummy_storage = dummy_storage(&repo_harness.conf.workdir);
let key = dummy_storage.storage_path(&original_path)?;
let download_destination = dummy_storage.local_path(&key)?;
assert_eq!(
original_path, download_destination,
"'original path -> storage key -> matching fs path' transformation should produce the same path as the input one for the correct path"
);
Ok(())
}
fn dummy_storage(pageserver_workdir: &'static Path) -> S3 {
S3 {
pageserver_workdir,
bucket: Bucket::new(
"dummy-bucket",
"us-east-1".parse().unwrap(),
Credentials::anonymous().unwrap(),
)
.unwrap(),
prefix_in_bucket: Some("dummy_prefix/".to_string()),
}
}
fn create_s3_key(relative_file_path: &Path, prefix: Option<&str>) -> S3ObjectKey {
S3ObjectKey(relative_file_path.iter().fold(
prefix.unwrap_or_default().to_string(),
|mut path_string, segment| {
path_string.push(S3_FILE_SEPARATOR);
path_string.push_str(segment.to_str().unwrap());
path_string
},
))
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,613 @@
//! A set of structs to represent a compressed part of the timeline, and methods to asynchronously compress and uncompress a stream of data,
//! without holding the entire data in memory.
//! For the latter, both compress and uncompress functions operate buffered streams (currently hardcoded size of [`ARCHIVE_STREAM_BUFFER_SIZE_BYTES`]),
//! not attempting to hold the entire archive in memory.
//!
//! The compression is done with <a href="https://datatracker.ietf.org/doc/html/rfc8878">zstd</a> streaming algorithm via the `async-compression` crate.
//! The crate does not contain any knobs to tweak the compression, but otherwise is one of the only ones that's both async and has an API to manage the part of an archive.
//! Zstd was picked as the best algorithm among the ones available in the crate, after testing the initial timeline file compression.
//!
//! Archiving is almost agnostic to timeline file types, with an exception of the metadata file, that's currently distinguished in the [un]compression code.
//! The metadata file is treated separately when [de]compression is involved, to reduce the risk of corrupting the metadata file.
//! When compressed, the metadata file is always required and stored as the last file in the archive stream.
//! When uncompressed, the metadata file gets naturally uncompressed last, to ensure that all other relishes are decompressed successfully first.
//!
//! Archive structure:
//! +----------------------------------------+
//! | header | file_1, ..., file_k, metadata |
//! +----------------------------------------+
//!
//! The archive consists of two separate zstd archives:
//! * header archive, that contains all files names and their sizes and relative paths in the timeline directory
//! Header is a Rust structure, serialized into bytes and compressed with zstd.
//! * files archive, that has metadata file as the last one, all compressed with zstd into a single binary blob
//!
//! Header offset is stored in the file name, along with the `disk_consistent_lsn` from the metadata file.
//! See [`parse_archive_name`] and [`ARCHIVE_EXTENSION`] for the name details, example: `00000000016B9150-.zst_9732`.
//! This way, the header could be retrieved without reading an entire archive file.
use std::{
collections::BTreeSet,
future::Future,
io::Cursor,
path::{Path, PathBuf},
sync::Arc,
};
use anyhow::{bail, ensure, Context};
use async_compression::tokio::bufread::{ZstdDecoder, ZstdEncoder};
use serde::{Deserialize, Serialize};
use tokio::{
fs,
io::{self, AsyncReadExt, AsyncWriteExt},
};
use tracing::*;
use zenith_utils::{bin_ser::BeSer, lsn::Lsn};
use crate::layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME};
use super::index::RelativePath;
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct ArchiveHeader {
/// All regular timeline files, excluding the metadata file.
pub files: Vec<FileEntry>,
// Metadata file name is known to the system, as its location relative to the timeline dir,
// so no need to store anything but its size in bytes.
pub metadata_file_size: u64,
}
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct FileEntry {
/// Uncompressed file size, bytes.
pub size: u64,
/// A path, relative to the directory root, used when compressing the directory contents.
pub subpath: RelativePath,
}
const ARCHIVE_EXTENSION: &str = "-.zst_";
const ARCHIVE_STREAM_BUFFER_SIZE_BYTES: usize = 4 * 1024 * 1024;
/// Streams an archive of files given into a stream target, defined by the closure.
///
/// The closure approach is picked for cases like S3, where we would need a name of the file before we can get a stream to write the bytes into.
/// Current idea is to place the header size in the name of the file, to enable the fast partial remote file index restoration without actually reading remote storage file contents.
///
/// Performs the compression in multiple steps:
/// * prepares an archive header, stripping the `source_dir` prefix from the `files`
/// * generates the name of the archive
/// * prepares archive producer future, knowing the header and the file list
/// An `impl AsyncRead` and `impl AsyncWrite` pair of connected streams is created to implement the partial contents streaming.
/// The writer end gets into the archive producer future, to put the header and a stream of compressed files.
/// * prepares archive consumer future, by executing the provided closure
/// The closure gets the reader end stream and the name of the file to create a future that would stream the file contents elsewhere.
/// * runs and waits for both futures to complete
/// * on a successful completion of both futures, header, its size and the user-defined consumer future return data is returned
/// Due to the design above, the archive name and related data is visible inside the consumer future only, so it's possible to return the data,
/// needed for future processing.
pub async fn archive_files_as_stream<Cons, ConsRet, Fut>(
source_dir: &Path,
files: impl Iterator<Item = &PathBuf>,
metadata: &TimelineMetadata,
create_archive_consumer: Cons,
) -> anyhow::Result<(ArchiveHeader, u64, ConsRet)>
where
Cons: FnOnce(Box<dyn io::AsyncRead + Unpin + Send + Sync + 'static>, String) -> Fut
+ Send
+ 'static,
Fut: Future<Output = anyhow::Result<ConsRet>> + Send + 'static,
ConsRet: Send + Sync + 'static,
{
let metadata_bytes = metadata
.to_bytes()
.context("Failed to create metadata bytes")?;
let (archive_header, compressed_header_bytes) =
prepare_header(source_dir, files, &metadata_bytes)
.await
.context("Failed to prepare file for archivation")?;
let header_size = compressed_header_bytes.len() as u64;
let (write, read) = io::duplex(ARCHIVE_STREAM_BUFFER_SIZE_BYTES);
let archive_filler = write_archive_contents(
source_dir.to_path_buf(),
archive_header.clone(),
metadata_bytes,
write,
);
let archive_name = archive_name(metadata.disk_consistent_lsn(), header_size);
let archive_stream =
Cursor::new(compressed_header_bytes).chain(ZstdEncoder::new(io::BufReader::new(read)));
let (archive_creation_result, archive_upload_result) = tokio::join!(
tokio::spawn(archive_filler),
tokio::spawn(async move {
create_archive_consumer(Box::new(archive_stream), archive_name).await
})
);
archive_creation_result
.context("Failed to spawn archive creation future")?
.context("Failed to create an archive")?;
let upload_return_value = archive_upload_result
.context("Failed to spawn archive upload future")?
.context("Failed to upload the archive")?;
Ok((archive_header, header_size, upload_return_value))
}
/// Similar to [`archive_files_as_stream`], creates a pair of streams to uncompress the 2nd part of the archive,
/// that contains files and is located after the header.
/// S3 allows downloading partial file contents for a given file key (i.e. name), to accommodate this retrieval,
/// a closure is used.
/// Same concepts with two concurrent futures, user-defined closure, future and return value apply here, but the
/// consumer and the receiver ends are swapped, since the uncompression happens.
pub async fn uncompress_file_stream_with_index<Prod, ProdRet, Fut>(
destination_dir: PathBuf,
files_to_skip: Arc<BTreeSet<PathBuf>>,
disk_consistent_lsn: Lsn,
header: ArchiveHeader,
header_size: u64,
create_archive_file_part: Prod,
) -> anyhow::Result<ProdRet>
where
Prod: FnOnce(Box<dyn io::AsyncWrite + Unpin + Send + Sync + 'static>, String) -> Fut
+ Send
+ 'static,
Fut: Future<Output = anyhow::Result<ProdRet>> + Send + 'static,
ProdRet: Send + Sync + 'static,
{
let (write, mut read) = io::duplex(ARCHIVE_STREAM_BUFFER_SIZE_BYTES);
let archive_name = archive_name(disk_consistent_lsn, header_size);
let (archive_download_result, archive_uncompress_result) = tokio::join!(
tokio::spawn(async move { create_archive_file_part(Box::new(write), archive_name).await }),
tokio::spawn(async move {
uncompress_with_header(&files_to_skip, &destination_dir, header, &mut read).await
})
);
let download_value = archive_download_result
.context("Failed to spawn archive download future")?
.context("Failed to download an archive")?;
archive_uncompress_result
.context("Failed to spawn archive uncompress future")?
.context("Failed to uncompress the archive")?;
Ok(download_value)
}
/// Reads archive header from the stream given:
/// * parses the file name to get the header size
/// * reads the exact amount of bytes
/// * uncompresses and deserializes those
pub async fn read_archive_header<A: io::AsyncRead + Send + Sync + Unpin>(
archive_name: &str,
from: &mut A,
) -> anyhow::Result<ArchiveHeader> {
let (_, header_size) = parse_archive_name(Path::new(archive_name))?;
let mut compressed_header_bytes = vec![0; header_size as usize];
from.read_exact(&mut compressed_header_bytes)
.await
.with_context(|| {
format!(
"Failed to read header header from the archive {}",
archive_name
)
})?;
let mut header_bytes = Vec::new();
ZstdDecoder::new(io::BufReader::new(compressed_header_bytes.as_slice()))
.read_to_end(&mut header_bytes)
.await
.context("Failed to decompress a header from the archive")?;
Ok(ArchiveHeader::des(&header_bytes)
.context("Failed to deserialize a header from the archive")?)
}
/// Reads the archive metadata out of the archive name:
/// * `disk_consistent_lsn` of the checkpoint that was archived
/// * size of the archive header
pub fn parse_archive_name(archive_path: &Path) -> anyhow::Result<(Lsn, u64)> {
let archive_name = archive_path
.file_name()
.with_context(|| format!("Archive '{}' has no file name", archive_path.display()))?
.to_string_lossy();
let (lsn_str, header_size_str) =
archive_name
.rsplit_once(ARCHIVE_EXTENSION)
.with_context(|| {
format!(
"Archive '{}' has incorrect extension, expected to contain '{}'",
archive_path.display(),
ARCHIVE_EXTENSION
)
})?;
let disk_consistent_lsn = Lsn::from_hex(lsn_str).with_context(|| {
format!(
"Archive '{}' has an invalid disk consistent lsn in its extension",
archive_path.display(),
)
})?;
let header_size = header_size_str.parse::<u64>().with_context(|| {
format!(
"Archive '{}' has an invalid a header offset number in its extension",
archive_path.display(),
)
})?;
Ok((disk_consistent_lsn, header_size))
}
fn archive_name(disk_consistent_lsn: Lsn, header_size: u64) -> String {
let archive_name = format!(
"{:016X}{ARCHIVE_EXTENSION}{}",
u64::from(disk_consistent_lsn),
header_size,
ARCHIVE_EXTENSION = ARCHIVE_EXTENSION,
);
archive_name
}
pub async fn uncompress_with_header(
files_to_skip: &BTreeSet<PathBuf>,
destination_dir: &Path,
header: ArchiveHeader,
archive_after_header: impl io::AsyncRead + Send + Sync + Unpin,
) -> anyhow::Result<()> {
debug!("Uncompressing archive into {}", destination_dir.display());
let mut archive = ZstdDecoder::new(io::BufReader::new(archive_after_header));
if !destination_dir.exists() {
fs::create_dir_all(&destination_dir)
.await
.with_context(|| {
format!(
"Failed to create target directory at {}",
destination_dir.display()
)
})?;
} else if !destination_dir.is_dir() {
bail!(
"Destination path '{}' is not a valid directory",
destination_dir.display()
);
}
debug!("Will extract {} files from the archive", header.files.len());
for entry in header.files {
uncompress_entry(
&mut archive,
&entry.subpath.as_path(destination_dir),
entry.size,
files_to_skip,
)
.await
.with_context(|| format!("Failed to uncompress archive entry {:?}", entry))?;
}
uncompress_entry(
&mut archive,
&destination_dir.join(METADATA_FILE_NAME),
header.metadata_file_size,
files_to_skip,
)
.await
.context("Failed to uncompress the metadata entry")?;
Ok(())
}
async fn uncompress_entry(
archive: &mut ZstdDecoder<io::BufReader<impl io::AsyncRead + Send + Sync + Unpin>>,
destination_path: &Path,
entry_size: u64,
files_to_skip: &BTreeSet<PathBuf>,
) -> anyhow::Result<()> {
if let Some(parent) = destination_path.parent() {
fs::create_dir_all(parent).await.with_context(|| {
format!(
"Failed to create parent directory for {}",
destination_path.display()
)
})?;
};
if files_to_skip.contains(destination_path) {
debug!("Skipping {}", destination_path.display());
copy_n_bytes(entry_size, archive, &mut io::sink())
.await
.context("Failed to skip bytes in the archive")?;
return Ok(());
}
let mut destination =
io::BufWriter::new(fs::File::create(&destination_path).await.with_context(|| {
format!(
"Failed to open file {} for extraction",
destination_path.display()
)
})?);
copy_n_bytes(entry_size, archive, &mut destination)
.await
.with_context(|| {
format!(
"Failed to write extracted archive contents into file {}",
destination_path.display()
)
})?;
destination
.flush()
.await
.context("Failed to flush the streaming archive bytes")?;
Ok(())
}
async fn write_archive_contents(
source_dir: PathBuf,
header: ArchiveHeader,
metadata_bytes: Vec<u8>,
mut archive_input: io::DuplexStream,
) -> anyhow::Result<()> {
debug!("Starting writing files into archive");
for file_entry in header.files {
let path = file_entry.subpath.as_path(&source_dir);
let mut source_file =
io::BufReader::new(fs::File::open(&path).await.with_context(|| {
format!(
"Failed to open file for archiving to path {}",
path.display()
)
})?);
let bytes_written = io::copy(&mut source_file, &mut archive_input)
.await
.with_context(|| {
format!(
"Failed to open add a file into archive, file path {}",
path.display()
)
})?;
ensure!(
file_entry.size == bytes_written,
"File {} was written to the archive incompletely",
path.display()
);
trace!(
"Added file '{}' ({} bytes) into the archive",
path.display(),
bytes_written
);
}
let metadata_bytes_written = io::copy(&mut metadata_bytes.as_slice(), &mut archive_input)
.await
.context("Failed to add metadata into the archive")?;
ensure!(
header.metadata_file_size == metadata_bytes_written,
"Metadata file was written to the archive incompletely",
);
archive_input
.shutdown()
.await
.context("Failed to finalize the archive")?;
debug!("Successfully streamed all files into the archive");
Ok(())
}
async fn prepare_header(
source_dir: &Path,
files: impl Iterator<Item = &PathBuf>,
metadata_bytes: &[u8],
) -> anyhow::Result<(ArchiveHeader, Vec<u8>)> {
let mut archive_files = Vec::new();
for file_path in files {
let file_metadata = fs::metadata(file_path).await.with_context(|| {
format!(
"Failed to read metadata during archive indexing for {}",
file_path.display()
)
})?;
ensure!(
file_metadata.is_file(),
"Archive indexed path {} is not a file",
file_path.display()
);
if file_path.file_name().and_then(|name| name.to_str()) != Some(METADATA_FILE_NAME) {
let entry = FileEntry {
subpath: RelativePath::new(source_dir, file_path).with_context(|| {
format!(
"File '{}' does not belong to pageserver workspace",
file_path.display()
)
})?,
size: file_metadata.len(),
};
archive_files.push(entry);
}
}
let header = ArchiveHeader {
files: archive_files,
metadata_file_size: metadata_bytes.len() as u64,
};
debug!("Appending a header for {} files", header.files.len());
let header_bytes = header.ser().context("Failed to serialize a header")?;
debug!("Header bytes len {}", header_bytes.len());
let mut compressed_header_bytes = Vec::new();
ZstdEncoder::new(io::BufReader::new(header_bytes.as_slice()))
.read_to_end(&mut compressed_header_bytes)
.await
.context("Failed to compress header bytes")?;
debug!(
"Compressed header bytes len {}",
compressed_header_bytes.len()
);
Ok((header, compressed_header_bytes))
}
async fn copy_n_bytes(
n: u64,
from: &mut (impl io::AsyncRead + Send + Sync + Unpin),
into: &mut (impl io::AsyncWrite + Send + Sync + Unpin),
) -> anyhow::Result<()> {
let bytes_written = io::copy(&mut from.take(n), into).await?;
ensure!(
bytes_written == n,
"Failed to read exactly {} bytes from the input, bytes written: {}",
n,
bytes_written,
);
Ok(())
}
#[cfg(test)]
mod tests {
use tokio::{fs, io::AsyncSeekExt};
use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID};
use super::*;
#[tokio::test]
async fn compress_and_uncompress() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("compress_and_uncompress")?;
let timeline_dir = repo_harness.timeline_path(&TIMELINE_ID);
init_directory(
&timeline_dir,
vec![
("first", "first_contents"),
("second", "second_contents"),
(METADATA_FILE_NAME, "wrong_metadata"),
],
)
.await?;
let timeline_files = list_file_paths_with_contents(&timeline_dir).await?;
assert_eq!(
timeline_files,
vec![
(
timeline_dir.join("first"),
FileContents::Text("first_contents".to_string())
),
(
timeline_dir.join(METADATA_FILE_NAME),
FileContents::Text("wrong_metadata".to_string())
),
(
timeline_dir.join("second"),
FileContents::Text("second_contents".to_string())
),
],
"Initial timeline contents should contain two normal files and a wrong metadata file"
);
let metadata = TimelineMetadata::new(Lsn(0x30), None, None, Lsn(0), Lsn(0), Lsn(0));
let paths_to_archive = timeline_files
.into_iter()
.map(|(path, _)| path)
.collect::<Vec<_>>();
let tempdir = tempfile::tempdir()?;
let base_path = tempdir.path().to_path_buf();
let (header, header_size, archive_target) = archive_files_as_stream(
&timeline_dir,
paths_to_archive.iter(),
&metadata,
move |mut archive_streamer, archive_name| async move {
let archive_target = base_path.join(&archive_name);
let mut archive_file = fs::File::create(&archive_target).await?;
io::copy(&mut archive_streamer, &mut archive_file).await?;
Ok(archive_target)
},
)
.await?;
let mut file = fs::File::open(&archive_target).await?;
file.seek(io::SeekFrom::Start(header_size)).await?;
let target_dir = tempdir.path().join("extracted");
uncompress_with_header(&BTreeSet::new(), &target_dir, header, file).await?;
let extracted_files = list_file_paths_with_contents(&target_dir).await?;
assert_eq!(
extracted_files,
vec![
(
target_dir.join("first"),
FileContents::Text("first_contents".to_string())
),
(
target_dir.join(METADATA_FILE_NAME),
FileContents::Binary(metadata.to_bytes()?)
),
(
target_dir.join("second"),
FileContents::Text("second_contents".to_string())
),
],
"Extracted files should contain all local timeline files besides its metadata, which should be taken from the arguments"
);
Ok(())
}
async fn init_directory(
root: &Path,
files_with_contents: Vec<(&str, &str)>,
) -> anyhow::Result<()> {
fs::create_dir_all(root).await?;
for (file_name, contents) in files_with_contents {
fs::File::create(root.join(file_name))
.await?
.write_all(contents.as_bytes())
.await?;
}
Ok(())
}
#[derive(PartialEq, Eq, PartialOrd, Ord)]
enum FileContents {
Text(String),
Binary(Vec<u8>),
}
impl std::fmt::Debug for FileContents {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Text(text) => f.debug_tuple("Text").field(text).finish(),
Self::Binary(bytes) => f
.debug_tuple("Binary")
.field(&format!("{} bytes", bytes.len()))
.finish(),
}
}
}
async fn list_file_paths_with_contents(
root: &Path,
) -> anyhow::Result<Vec<(PathBuf, FileContents)>> {
let mut file_paths = Vec::new();
let mut dir_listings = vec![fs::read_dir(root).await?];
while let Some(mut dir_listing) = dir_listings.pop() {
while let Some(entry) = dir_listing.next_entry().await? {
let entry_path = entry.path();
if entry_path.is_file() {
let contents = match String::from_utf8(fs::read(&entry_path).await?) {
Ok(text) => FileContents::Text(text),
Err(e) => FileContents::Binary(e.into_bytes()),
};
file_paths.push((entry_path, contents));
} else if entry_path.is_dir() {
dir_listings.push(fs::read_dir(entry_path).await?);
} else {
info!(
"Skipping path '{}' as it's not a file or a directory",
entry_path.display()
);
}
}
}
file_paths.sort();
Ok(file_paths)
}
}

View File

@@ -0,0 +1,428 @@
//! Timeline synchrnonization logic to put files from archives on remote storage into pageserver's local directory.
//! Currently, tenant branch files are also downloaded, but this does not appear final.
use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc};
use anyhow::{ensure, Context};
use futures::{stream::FuturesUnordered, StreamExt};
use tokio::{fs, sync::RwLock};
use tracing::{debug, error, trace, warn};
use zenith_utils::{lsn::Lsn, zid::ZTenantId};
use crate::{
config::PageServerConf,
layered_repository::metadata::{metadata_path, TimelineMetadata},
remote_storage::{
storage_sync::{
compression, index::TimelineIndexEntry, sync_queue, tenant_branch_files,
update_index_description, SyncKind, SyncTask,
},
RemoteStorage, ZTenantTimelineId,
},
};
use super::{
index::{ArchiveId, RemoteTimeline, RemoteTimelineIndex},
TimelineDownload,
};
/// Timeline download result, with extra data, needed for downloading.
pub(super) enum DownloadedTimeline {
/// Remote timeline data is either absent or corrupt, no download possible.
Abort,
/// Remote timeline data is found, its latest checkpoint's metadata contents (disk_consistent_lsn) is known.
/// Initial download failed due to some error, the download task is rescheduled for another retry.
FailedAndRescheduled { disk_consistent_lsn: Lsn },
/// Remote timeline data is found, its latest checkpoint's metadata contents (disk_consistent_lsn) is known.
/// Initial download successful.
Successful { disk_consistent_lsn: Lsn },
}
/// Attempts to download and uncompress files from all remote archives for the timeline given.
/// Timeline files that already exist locally are skipped during the download, but the local metadata file is
/// updated in the end of every checkpoint archive extraction.
///
/// Before any archives are considered, the branch files are checked locally and remotely, all remote-only files are downloaded.
///
/// On an error, bumps the retries count and reschedules the download, with updated archive skip list
/// (for any new successful archive downloads and extractions).
pub(super) async fn download_timeline<
P: std::fmt::Debug + Send + Sync + 'static,
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
conf: &'static PageServerConf,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
sync_id: ZTenantTimelineId,
mut download: TimelineDownload,
retries: u32,
) -> DownloadedTimeline {
debug!("Downloading layers for sync id {}", sync_id);
let ZTenantTimelineId {
tenant_id,
timeline_id,
} = sync_id;
let index_read = remote_assets.1.read().await;
let remote_timeline = match index_read.timeline_entry(&sync_id) {
None => {
error!("Cannot download: no timeline is present in the index for given ids");
return DownloadedTimeline::Abort;
}
Some(index_entry) => match index_entry {
TimelineIndexEntry::Full(remote_timeline) => Cow::Borrowed(remote_timeline),
TimelineIndexEntry::Description(_) => {
let remote_disk_consistent_lsn = index_entry.disk_consistent_lsn();
drop(index_read);
debug!("Found timeline description for the given ids, downloading the full index");
match update_index_description(
remote_assets.as_ref(),
&conf.timeline_path(&timeline_id, &tenant_id),
sync_id,
)
.await
{
Ok(remote_timeline) => Cow::Owned(remote_timeline),
Err(e) => {
error!("Failed to download full timeline index: {:?}", e);
return match remote_disk_consistent_lsn {
Some(disk_consistent_lsn) => {
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Download(download),
));
DownloadedTimeline::FailedAndRescheduled {
disk_consistent_lsn,
}
}
None => {
error!("Cannot download: no disk consistent Lsn is present for the index entry");
DownloadedTimeline::Abort
}
};
}
}
}
},
};
let disk_consistent_lsn = match remote_timeline.checkpoints().max() {
Some(lsn) => lsn,
None => {
debug!("Cannot download: no disk consistent Lsn is present for the remote timeline");
return DownloadedTimeline::Abort;
}
};
if let Err(e) = download_missing_branches(conf, remote_assets.as_ref(), sync_id.tenant_id).await
{
error!(
"Failed to download missing branches for sync id {}: {:?}",
sync_id, e
);
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Download(download),
));
return DownloadedTimeline::FailedAndRescheduled {
disk_consistent_lsn,
};
}
debug!("Downloading timeline archives");
let archives_to_download = remote_timeline
.checkpoints()
.map(ArchiveId)
.filter(|remote_archive| !download.archives_to_skip.contains(remote_archive))
.collect::<Vec<_>>();
let archives_total = archives_to_download.len();
debug!("Downloading {} archives of a timeline", archives_total);
trace!("Archives to download: {:?}", archives_to_download);
for (archives_downloaded, archive_id) in archives_to_download.into_iter().enumerate() {
match try_download_archive(
conf,
sync_id,
Arc::clone(&remote_assets),
remote_timeline.as_ref(),
archive_id,
Arc::clone(&download.files_to_skip),
)
.await
{
Err(e) => {
let archives_left = archives_total - archives_downloaded;
error!(
"Failed to download archive {:?} (archives downloaded: {}; archives left: {}) for tenant {} timeline {}, requeueing the download: {:?}",
archive_id, archives_downloaded, archives_left, tenant_id, timeline_id, e
);
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Download(download),
));
return DownloadedTimeline::FailedAndRescheduled {
disk_consistent_lsn,
};
}
Ok(()) => {
debug!("Successfully downloaded archive {:?}", archive_id);
download.archives_to_skip.insert(archive_id);
}
}
}
debug!("Finished downloading all timeline's archives");
DownloadedTimeline::Successful {
disk_consistent_lsn,
}
}
async fn try_download_archive<
P: Send + Sync + 'static,
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
conf: &'static PageServerConf,
ZTenantTimelineId {
tenant_id,
timeline_id,
}: ZTenantTimelineId,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
remote_timeline: &RemoteTimeline,
archive_id: ArchiveId,
files_to_skip: Arc<BTreeSet<PathBuf>>,
) -> anyhow::Result<()> {
debug!("Downloading archive {:?}", archive_id);
let archive_to_download = remote_timeline
.archive_data(archive_id)
.with_context(|| format!("Archive {:?} not found in remote storage", archive_id))?;
let (archive_header, header_size) = remote_timeline
.restore_header(archive_id)
.context("Failed to restore header when downloading an archive")?;
match read_local_metadata(conf, timeline_id, tenant_id).await {
Ok(local_metadata) => ensure!(
// need to allow `<=` instead of `<` due to cases when a failed archive can be redownloaded
local_metadata.disk_consistent_lsn() <= archive_to_download.disk_consistent_lsn(),
"Cannot download archive with Lsn {} since it's earlier than local Lsn {}",
archive_to_download.disk_consistent_lsn(),
local_metadata.disk_consistent_lsn()
),
Err(e) => warn!("Failed to read local metadata file, assuming it's safe to override its with the download. Read: {:#}", e),
}
compression::uncompress_file_stream_with_index(
conf.timeline_path(&timeline_id, &tenant_id),
files_to_skip,
archive_to_download.disk_consistent_lsn(),
archive_header,
header_size,
move |mut archive_target, archive_name| async move {
let archive_local_path = conf
.timeline_path(&timeline_id, &tenant_id)
.join(&archive_name);
let remote_storage = &remote_assets.0;
remote_storage
.download_range(
&remote_storage.storage_path(&archive_local_path)?,
header_size,
None,
&mut archive_target,
)
.await
},
)
.await?;
Ok(())
}
async fn read_local_metadata(
conf: &'static PageServerConf,
timeline_id: zenith_utils::zid::ZTimelineId,
tenant_id: ZTenantId,
) -> anyhow::Result<TimelineMetadata> {
let local_metadata_path = metadata_path(conf, timeline_id, tenant_id);
let local_metadata_bytes = fs::read(&local_metadata_path)
.await
.context("Failed to read local metadata file bytes")?;
Ok(TimelineMetadata::from_bytes(&local_metadata_bytes)
.context("Failed to read local metadata files bytes")?)
}
async fn download_missing_branches<
P: std::fmt::Debug + Send + Sync + 'static,
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
conf: &'static PageServerConf,
(storage, index): &(S, RwLock<RemoteTimelineIndex>),
tenant_id: ZTenantId,
) -> anyhow::Result<()> {
let local_branches = tenant_branch_files(conf, tenant_id)
.await
.context("Failed to list local branch files for the tenant")?;
let local_branches_dir = conf.branches_path(&tenant_id);
if !local_branches_dir.exists() {
fs::create_dir_all(&local_branches_dir)
.await
.with_context(|| {
format!(
"Failed to create local branches directory at path '{}'",
local_branches_dir.display()
)
})?;
}
if let Some(remote_branches) = index.read().await.branch_files(tenant_id) {
let mut remote_only_branches_downloads = remote_branches
.difference(&local_branches)
.map(|remote_only_branch| async move {
let branches_dir = conf.branches_path(&tenant_id);
let remote_branch_path = remote_only_branch.as_path(&branches_dir);
let storage_path =
storage.storage_path(&remote_branch_path).with_context(|| {
format!(
"Failed to derive a storage path for branch with local path '{}'",
remote_branch_path.display()
)
})?;
let mut target_file = fs::OpenOptions::new()
.write(true)
.create_new(true)
.open(&remote_branch_path)
.await
.with_context(|| {
format!(
"Failed to create local branch file at '{}'",
remote_branch_path.display()
)
})?;
storage
.download(&storage_path, &mut target_file)
.await
.with_context(|| {
format!(
"Failed to download branch file from the remote path {:?}",
storage_path
)
})?;
Ok::<_, anyhow::Error>(())
})
.collect::<FuturesUnordered<_>>();
let mut branch_downloads_failed = false;
while let Some(download_result) = remote_only_branches_downloads.next().await {
if let Err(e) = download_result {
branch_downloads_failed = true;
error!("Failed to download a branch file: {:?}", e);
}
}
ensure!(
!branch_downloads_failed,
"Failed to download all branch files"
);
}
Ok(())
}
#[cfg(test)]
mod tests {
use std::collections::BTreeSet;
use tempfile::tempdir;
use tokio::fs;
use zenith_utils::lsn::Lsn;
use crate::{
remote_storage::{
local_fs::LocalFs,
storage_sync::test_utils::{
assert_index_descriptions, assert_timeline_files_match, create_local_timeline,
dummy_metadata, ensure_correct_timeline_upload, expect_timeline,
},
},
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use super::*;
#[tokio::test]
async fn test_download_timeline() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("test_download_timeline")?;
let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?;
let index = RwLock::new(RemoteTimelineIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
storage
.list()
.await?
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
));
let remote_assets = Arc::new((storage, index));
let storage = &remote_assets.0;
let index = &remote_assets.1;
let regular_timeline_path = repo_harness.timeline_path(&TIMELINE_ID);
let regular_timeline = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&["a", "b"],
dummy_metadata(Lsn(0x30)),
)?;
ensure_correct_timeline_upload(
&repo_harness,
Arc::clone(&remote_assets),
TIMELINE_ID,
regular_timeline,
)
.await;
// upload multiple checkpoints for the same timeline
let regular_timeline = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&["c", "d"],
dummy_metadata(Lsn(0x40)),
)?;
ensure_correct_timeline_upload(
&repo_harness,
Arc::clone(&remote_assets),
TIMELINE_ID,
regular_timeline,
)
.await;
fs::remove_dir_all(&regular_timeline_path).await?;
let remote_regular_timeline = expect_timeline(index, sync_id).await;
download_timeline(
repo_harness.conf,
Arc::clone(&remote_assets),
sync_id,
TimelineDownload {
files_to_skip: Arc::new(BTreeSet::new()),
archives_to_skip: BTreeSet::new(),
},
0,
)
.await;
assert_index_descriptions(
index,
RemoteTimelineIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
remote_assets
.0
.list()
.await
.unwrap()
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
),
)
.await;
assert_timeline_files_match(&repo_harness, TIMELINE_ID, remote_regular_timeline);
Ok(())
}
}

View File

@@ -0,0 +1,427 @@
//! In-memory index to track the tenant files on the remote strorage, mitigating the storage format differences between the local and remote files.
//! Able to restore itself from the storage archive data and reconstruct archive indices on demand.
//!
//! The index is intended to be portable, so deliberately does not store any local paths inside.
//! This way in the future, the index could be restored fast from its serialized stored form.
use std::{
collections::{BTreeMap, BTreeSet, HashMap, HashSet},
path::{Path, PathBuf},
};
use anyhow::{bail, ensure, Context};
use serde::{Deserialize, Serialize};
use tracing::debug;
use zenith_utils::{
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
use crate::{
config::PageServerConf,
layered_repository::TIMELINES_SEGMENT_NAME,
remote_storage::{
storage_sync::compression::{parse_archive_name, FileEntry},
ZTenantTimelineId,
},
};
use super::compression::ArchiveHeader;
/// A part of the filesystem path, that needs a root to become a path again.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash, Serialize, Deserialize)]
pub struct RelativePath(String);
impl RelativePath {
/// Attempts to strip off the base from path, producing a relative path or an error.
pub fn new<P: AsRef<Path>>(base: &Path, path: P) -> anyhow::Result<Self> {
let relative = path
.as_ref()
.strip_prefix(base)
.context("path is not relative to base")?;
Ok(RelativePath(relative.to_string_lossy().to_string()))
}
/// Joins the relative path with the base path.
pub fn as_path(&self, base: &Path) -> PathBuf {
base.join(&self.0)
}
}
/// An index to track tenant files that exist on the remote storage.
/// Currently, timeline archives and branch files are tracked.
#[derive(Debug, Clone)]
pub struct RemoteTimelineIndex {
branch_files: HashMap<ZTenantId, HashSet<RelativePath>>,
timeline_files: HashMap<ZTenantTimelineId, TimelineIndexEntry>,
}
impl RemoteTimelineIndex {
/// Attempts to parse file paths (not checking the file contents) and find files
/// that can be tracked wiht the index.
/// On parse falures, logs the error and continues, so empty index can be created from not suitable paths.
pub fn try_parse_descriptions_from_paths<P: AsRef<Path>>(
conf: &'static PageServerConf,
paths: impl Iterator<Item = P>,
) -> Self {
let mut index = Self {
branch_files: HashMap::new(),
timeline_files: HashMap::new(),
};
for path in paths {
if let Err(e) = try_parse_index_entry(&mut index, conf, path.as_ref()) {
debug!(
"Failed to parse path '{}' as index entry: {:#}",
path.as_ref().display(),
e
);
}
}
index
}
pub fn timeline_entry(&self, id: &ZTenantTimelineId) -> Option<&TimelineIndexEntry> {
self.timeline_files.get(id)
}
pub fn timeline_entry_mut(
&mut self,
id: &ZTenantTimelineId,
) -> Option<&mut TimelineIndexEntry> {
self.timeline_files.get_mut(id)
}
pub fn add_timeline_entry(&mut self, id: ZTenantTimelineId, entry: TimelineIndexEntry) {
self.timeline_files.insert(id, entry);
}
pub fn all_sync_ids(&self) -> impl Iterator<Item = ZTenantTimelineId> + '_ {
self.timeline_files.keys().copied()
}
pub fn add_branch_file(&mut self, tenant_id: ZTenantId, path: RelativePath) {
self.branch_files
.entry(tenant_id)
.or_insert_with(HashSet::new)
.insert(path);
}
pub fn branch_files(&self, tenant_id: ZTenantId) -> Option<&HashSet<RelativePath>> {
self.branch_files.get(&tenant_id)
}
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TimelineIndexEntry {
/// An archive found on the remote storage, but not yet downloaded, only a metadata from its storage path is available, without archive contents.
Description(BTreeMap<ArchiveId, ArchiveDescription>),
/// Full archive metadata, including the file list, parsed from the archive header.
Full(RemoteTimeline),
}
impl TimelineIndexEntry {
pub fn uploaded_checkpoints(&self) -> BTreeSet<Lsn> {
match self {
Self::Description(description) => {
description.keys().map(|archive_id| archive_id.0).collect()
}
Self::Full(remote_timeline) => remote_timeline
.checkpoint_archives
.keys()
.map(|archive_id| archive_id.0)
.collect(),
}
}
/// Gets latest uploaded checkpoint's disk consisten Lsn for the corresponding timeline.
pub fn disk_consistent_lsn(&self) -> Option<Lsn> {
match self {
Self::Description(description) => {
description.keys().map(|archive_id| archive_id.0).max()
}
Self::Full(remote_timeline) => remote_timeline
.checkpoint_archives
.keys()
.map(|archive_id| archive_id.0)
.max(),
}
}
}
/// Checkpoint archive's id, corresponding to the `disk_consistent_lsn` from the timeline's metadata file during checkpointing.
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy)]
pub struct ArchiveId(pub(super) Lsn);
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy)]
struct FileId(ArchiveId, ArchiveEntryNumber);
type ArchiveEntryNumber = usize;
/// All archives and files in them, representing a certain timeline.
/// Uses file and archive IDs to reference those without ownership issues.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct RemoteTimeline {
timeline_files: BTreeMap<FileId, FileEntry>,
checkpoint_archives: BTreeMap<ArchiveId, CheckpointArchive>,
}
/// Archive metadata, enough to restore a header with the timeline data.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct CheckpointArchive {
disk_consistent_lsn: Lsn,
metadata_file_size: u64,
files: BTreeSet<FileId>,
archive_header_size: u64,
}
impl CheckpointArchive {
pub fn disk_consistent_lsn(&self) -> Lsn {
self.disk_consistent_lsn
}
}
impl RemoteTimeline {
pub fn empty() -> Self {
Self {
timeline_files: BTreeMap::new(),
checkpoint_archives: BTreeMap::new(),
}
}
pub fn checkpoints(&self) -> impl Iterator<Item = Lsn> + '_ {
self.checkpoint_archives
.values()
.map(CheckpointArchive::disk_consistent_lsn)
}
/// Lists all relish files in the given remote timeline. Omits the metadata file.
pub fn stored_files(&self, timeline_dir: &Path) -> BTreeSet<PathBuf> {
self.timeline_files
.values()
.map(|file_entry| file_entry.subpath.as_path(timeline_dir))
.collect()
}
pub fn contains_checkpoint_at(&self, disk_consistent_lsn: Lsn) -> bool {
self.checkpoint_archives
.contains_key(&ArchiveId(disk_consistent_lsn))
}
pub fn archive_data(&self, archive_id: ArchiveId) -> Option<&CheckpointArchive> {
self.checkpoint_archives.get(&archive_id)
}
/// Restores a header of a certain remote archive from the memory data.
/// Returns the header and its compressed size in the archive, both can be used to uncompress that archive.
pub fn restore_header(&self, archive_id: ArchiveId) -> anyhow::Result<(ArchiveHeader, u64)> {
let archive = self
.checkpoint_archives
.get(&archive_id)
.with_context(|| format!("Archive {:?} not found", archive_id))?;
let mut header_files = Vec::with_capacity(archive.files.len());
for (expected_archive_position, archive_file) in archive.files.iter().enumerate() {
let &FileId(archive_id, archive_position) = archive_file;
ensure!(
expected_archive_position == archive_position,
"Archive header is corrupt, file # {} from archive {:?} header is missing",
expected_archive_position,
archive_id,
);
let timeline_file = self.timeline_files.get(archive_file).with_context(|| {
format!(
"File with id {:?} not found for archive {:?}",
archive_file, archive_id
)
})?;
header_files.push(timeline_file.clone());
}
Ok((
ArchiveHeader {
files: header_files,
metadata_file_size: archive.metadata_file_size,
},
archive.archive_header_size,
))
}
/// Updates (creates, if necessary) the data about certain archive contents.
pub fn update_archive_contents(
&mut self,
disk_consistent_lsn: Lsn,
header: ArchiveHeader,
header_size: u64,
) {
let archive_id = ArchiveId(disk_consistent_lsn);
let mut common_archive_files = BTreeSet::new();
for (file_index, file_entry) in header.files.into_iter().enumerate() {
let file_id = FileId(archive_id, file_index);
self.timeline_files.insert(file_id, file_entry);
common_archive_files.insert(file_id);
}
let metadata_file_size = header.metadata_file_size;
self.checkpoint_archives
.entry(archive_id)
.or_insert_with(|| CheckpointArchive {
metadata_file_size,
files: BTreeSet::new(),
archive_header_size: header_size,
disk_consistent_lsn,
})
.files
.extend(common_archive_files.into_iter());
}
}
/// Metadata abput timeline checkpoint archive, parsed from its remote storage path.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct ArchiveDescription {
pub header_size: u64,
pub disk_consistent_lsn: Lsn,
pub archive_name: String,
}
fn try_parse_index_entry(
index: &mut RemoteTimelineIndex,
conf: &'static PageServerConf,
path: &Path,
) -> anyhow::Result<()> {
let tenants_dir = conf.tenants_path();
let tenant_id = path
.strip_prefix(&tenants_dir)
.with_context(|| {
format!(
"Path '{}' does not belong to tenants directory '{}'",
path.display(),
tenants_dir.display(),
)
})?
.iter()
.next()
.with_context(|| format!("Found no tenant id in path '{}'", path.display()))?
.to_string_lossy()
.parse::<ZTenantId>()
.with_context(|| format!("Failed to parse tenant id from path '{}'", path.display()))?;
let branches_path = conf.branches_path(&tenant_id);
let timelines_path = conf.timelines_path(&tenant_id);
match (
RelativePath::new(&branches_path, &path),
path.strip_prefix(&timelines_path),
) {
(Ok(_), Ok(_)) => bail!(
"Path '{}' cannot start with both branches '{}' and the timelines '{}' prefixes",
path.display(),
branches_path.display(),
timelines_path.display()
),
(Ok(branches_entry), Err(_)) => index.add_branch_file(tenant_id, branches_entry),
(Err(_), Ok(timelines_subpath)) => {
let mut segments = timelines_subpath.iter();
let timeline_id = segments
.next()
.with_context(|| {
format!(
"{} directory of tenant {} (path '{}') is not an index entry",
TIMELINES_SEGMENT_NAME,
tenant_id,
path.display()
)
})?
.to_string_lossy()
.parse::<ZTimelineId>()
.with_context(|| {
format!("Failed to parse timeline id from path '{}'", path.display())
})?;
let (disk_consistent_lsn, header_size) =
parse_archive_name(path).with_context(|| {
format!(
"Failed to parse archive name out in path '{}'",
path.display()
)
})?;
let archive_name = path
.file_name()
.with_context(|| format!("Archive '{}' has no file name", path.display()))?
.to_string_lossy()
.to_string();
let sync_id = ZTenantTimelineId {
tenant_id,
timeline_id,
};
let timeline_index_entry = index
.timeline_files
.entry(sync_id)
.or_insert_with(|| TimelineIndexEntry::Description(BTreeMap::new()));
match timeline_index_entry {
TimelineIndexEntry::Description(descriptions) => {
descriptions.insert(
ArchiveId(disk_consistent_lsn),
ArchiveDescription {
header_size,
disk_consistent_lsn,
archive_name,
},
);
}
TimelineIndexEntry::Full(_) => {
bail!("Cannot add parsed archive description to its full context in index with sync id {}", sync_id)
}
}
}
(Err(branches_error), Err(timelines_strip_error)) => {
bail!(
"Path '{}' is not an index entry: it's neither parsable as a branch entry '{:#}' nor as an archive entry '{}'",
path.display(),
branches_error,
timelines_strip_error,
)
}
}
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn header_restoration_preserves_file_order() {
let header = ArchiveHeader {
files: vec![
FileEntry {
size: 5,
subpath: RelativePath("one".to_string()),
},
FileEntry {
size: 1,
subpath: RelativePath("two".to_string()),
},
FileEntry {
size: 222,
subpath: RelativePath("zero".to_string()),
},
],
metadata_file_size: 5,
};
let lsn = Lsn(1);
let mut remote_timeline = RemoteTimeline::empty();
remote_timeline.update_archive_contents(lsn, header.clone(), 15);
let (restored_header, _) = remote_timeline
.restore_header(ArchiveId(lsn))
.expect("Should be able to restore header from a valid remote timeline");
assert_eq!(
header, restored_header,
"Header restoration should preserve file order"
);
}
}

View File

@@ -0,0 +1,573 @@
//! Timeline synchronization logic to compress and upload to the remote storage all new timeline files from the checkpoints.
//! Currently, tenant branch files are also uploaded, but this does not appear final.
use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc};
use anyhow::{ensure, Context};
use futures::{stream::FuturesUnordered, StreamExt};
use tokio::{fs, sync::RwLock};
use tracing::{debug, error, warn};
use zenith_utils::zid::ZTenantId;
use crate::{
config::PageServerConf,
remote_storage::{
storage_sync::{
compression,
index::{RemoteTimeline, TimelineIndexEntry},
sync_queue, tenant_branch_files, update_index_description, SyncKind, SyncTask,
},
RemoteStorage, ZTenantTimelineId,
},
};
use super::{compression::ArchiveHeader, index::RemoteTimelineIndex, NewCheckpoint};
/// Attempts to compress and upload given checkpoint files.
/// No extra checks for overlapping files is made: download takes care of that, ensuring no non-metadata local timeline files are overwritten.
///
/// Before the checkpoint files are uploaded, branch files are uploaded, if any local ones are missing remotely.
///
/// On an error, bumps the retries count and reschedules the entire task.
/// On success, populates index data with new downloads.
pub(super) async fn upload_timeline_checkpoint<
P: std::fmt::Debug + Send + Sync + 'static,
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
config: &'static PageServerConf,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
sync_id: ZTenantTimelineId,
new_checkpoint: NewCheckpoint,
retries: u32,
) -> Option<bool> {
debug!("Uploading checkpoint for sync id {}", sync_id);
if let Err(e) = upload_missing_branches(config, remote_assets.as_ref(), sync_id.tenant_id).await
{
error!(
"Failed to upload missing branches for sync id {}: {:?}",
sync_id, e
);
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Upload(new_checkpoint),
));
return Some(false);
}
let new_upload_lsn = new_checkpoint.metadata.disk_consistent_lsn();
let index = &remote_assets.1;
let ZTenantTimelineId {
tenant_id,
timeline_id,
} = sync_id;
let timeline_dir = config.timeline_path(&timeline_id, &tenant_id);
let index_read = index.read().await;
let remote_timeline = match index_read.timeline_entry(&sync_id) {
None => None,
Some(TimelineIndexEntry::Full(remote_timeline)) => Some(Cow::Borrowed(remote_timeline)),
Some(TimelineIndexEntry::Description(_)) => {
debug!("Found timeline description for the given ids, downloading the full index");
match update_index_description(remote_assets.as_ref(), &timeline_dir, sync_id).await {
Ok(remote_timeline) => Some(Cow::Owned(remote_timeline)),
Err(e) => {
error!("Failed to download full timeline index: {:?}", e);
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Upload(new_checkpoint),
));
return Some(false);
}
}
}
};
let already_contains_upload_lsn = remote_timeline
.as_ref()
.map(|remote_timeline| remote_timeline.contains_checkpoint_at(new_upload_lsn))
.unwrap_or(false);
if already_contains_upload_lsn {
warn!(
"Received a checkpoint with Lsn {} that's already been uploaded to remote storage, skipping the upload.",
new_upload_lsn
);
return None;
}
let already_uploaded_files = remote_timeline
.map(|timeline| timeline.stored_files(&timeline_dir))
.unwrap_or_default();
drop(index_read);
match try_upload_checkpoint(
config,
Arc::clone(&remote_assets),
sync_id,
&new_checkpoint,
already_uploaded_files,
)
.await
{
Ok((archive_header, header_size)) => {
let mut index_write = index.write().await;
match index_write.timeline_entry_mut(&sync_id) {
Some(TimelineIndexEntry::Full(remote_timeline)) => {
remote_timeline.update_archive_contents(
new_checkpoint.metadata.disk_consistent_lsn(),
archive_header,
header_size,
);
}
None | Some(TimelineIndexEntry::Description(_)) => {
let mut new_timeline = RemoteTimeline::empty();
new_timeline.update_archive_contents(
new_checkpoint.metadata.disk_consistent_lsn(),
archive_header,
header_size,
);
index_write.add_timeline_entry(sync_id, TimelineIndexEntry::Full(new_timeline));
}
}
debug!("Checkpoint uploaded successfully");
Some(true)
}
Err(e) => {
error!(
"Failed to upload checkpoint: {:?}, requeueing the upload",
e
);
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Upload(new_checkpoint),
));
Some(false)
}
}
}
async fn try_upload_checkpoint<
P: Send + Sync + 'static,
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
config: &'static PageServerConf,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
sync_id: ZTenantTimelineId,
new_checkpoint: &NewCheckpoint,
files_to_skip: BTreeSet<PathBuf>,
) -> anyhow::Result<(ArchiveHeader, u64)> {
let ZTenantTimelineId {
tenant_id,
timeline_id,
} = sync_id;
let timeline_dir = config.timeline_path(&timeline_id, &tenant_id);
let files_to_upload = new_checkpoint
.layers
.iter()
.filter(|&path_to_upload| {
if files_to_skip.contains(path_to_upload) {
error!(
"Skipping file upload '{}', since it was already uploaded",
path_to_upload.display()
);
false
} else {
true
}
})
.collect::<Vec<_>>();
ensure!(!files_to_upload.is_empty(), "No files to upload");
compression::archive_files_as_stream(
&timeline_dir,
files_to_upload.into_iter(),
&new_checkpoint.metadata,
move |archive_streamer, archive_name| async move {
let timeline_dir = config.timeline_path(&timeline_id, &tenant_id);
let remote_storage = &remote_assets.0;
remote_storage
.upload(
archive_streamer,
&remote_storage.storage_path(&timeline_dir.join(&archive_name))?,
)
.await
},
)
.await
.map(|(header, header_size, _)| (header, header_size))
}
async fn upload_missing_branches<
P: std::fmt::Debug + Send + Sync + 'static,
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
config: &'static PageServerConf,
(storage, index): &(S, RwLock<RemoteTimelineIndex>),
tenant_id: ZTenantId,
) -> anyhow::Result<()> {
let local_branches = tenant_branch_files(config, tenant_id)
.await
.context("Failed to list local branch files for the tenant")?;
let index_read = index.read().await;
let remote_branches = index_read
.branch_files(tenant_id)
.cloned()
.unwrap_or_default();
drop(index_read);
let mut branch_uploads = local_branches
.difference(&remote_branches)
.map(|local_only_branch| async move {
let local_branch_path = local_only_branch.as_path(&config.branches_path(&tenant_id));
let storage_path = storage.storage_path(&local_branch_path).with_context(|| {
format!(
"Failed to derive a storage path for branch with local path '{}'",
local_branch_path.display()
)
})?;
let local_branch_file = fs::OpenOptions::new()
.read(true)
.open(&local_branch_path)
.await
.with_context(|| {
format!(
"Failed to open local branch file {} for reading",
local_branch_path.display()
)
})?;
storage
.upload(local_branch_file, &storage_path)
.await
.with_context(|| {
format!(
"Failed to upload branch file to the remote path {:?}",
storage_path
)
})?;
Ok::<_, anyhow::Error>(local_only_branch)
})
.collect::<FuturesUnordered<_>>();
let mut branch_uploads_failed = false;
while let Some(upload_result) = branch_uploads.next().await {
match upload_result {
Ok(local_only_branch) => index
.write()
.await
.add_branch_file(tenant_id, local_only_branch.clone()),
Err(e) => {
error!("Failed to upload branch file: {:?}", e);
branch_uploads_failed = true;
}
}
}
ensure!(!branch_uploads_failed, "Failed to upload all branch files");
Ok(())
}
#[cfg(test)]
mod tests {
use tempfile::tempdir;
use zenith_utils::lsn::Lsn;
use crate::{
remote_storage::{
local_fs::LocalFs,
storage_sync::{
index::ArchiveId,
test_utils::{
assert_index_descriptions, create_local_timeline, dummy_metadata,
ensure_correct_timeline_upload, expect_timeline,
},
},
},
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use super::*;
#[tokio::test]
async fn reupload_timeline() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("reupload_timeline")?;
let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?;
let index = RwLock::new(RemoteTimelineIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
storage
.list()
.await?
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
));
let remote_assets = Arc::new((storage, index));
let index = &remote_assets.1;
let first_upload_metadata = dummy_metadata(Lsn(0x10));
let first_checkpoint = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&["a", "b"],
first_upload_metadata.clone(),
)?;
let local_timeline_path = repo_harness.timeline_path(&TIMELINE_ID);
ensure_correct_timeline_upload(
&repo_harness,
Arc::clone(&remote_assets),
TIMELINE_ID,
first_checkpoint,
)
.await;
let uploaded_timeline = expect_timeline(index, sync_id).await;
let uploaded_archives = uploaded_timeline
.checkpoints()
.map(ArchiveId)
.collect::<Vec<_>>();
assert_eq!(
uploaded_archives.len(),
1,
"Only one archive is expected after a first upload"
);
let first_uploaded_archive = uploaded_archives.first().copied().unwrap();
assert_eq!(
uploaded_timeline.checkpoints().last(),
Some(first_upload_metadata.disk_consistent_lsn()),
"Metadata that was uploaded, should have its Lsn stored"
);
assert_eq!(
uploaded_timeline
.archive_data(uploaded_archives.first().copied().unwrap())
.unwrap()
.disk_consistent_lsn(),
first_upload_metadata.disk_consistent_lsn(),
"Uploaded archive should have corresponding Lsn"
);
assert_eq!(
uploaded_timeline.stored_files(&local_timeline_path),
vec![local_timeline_path.join("a"), local_timeline_path.join("b")]
.into_iter()
.collect(),
"Should have all files from the first checkpoint"
);
let second_upload_metadata = dummy_metadata(Lsn(0x40));
let second_checkpoint = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&["b", "c"],
second_upload_metadata.clone(),
)?;
assert!(
first_upload_metadata.disk_consistent_lsn()
< second_upload_metadata.disk_consistent_lsn()
);
ensure_correct_timeline_upload(
&repo_harness,
Arc::clone(&remote_assets),
TIMELINE_ID,
second_checkpoint,
)
.await;
let updated_timeline = expect_timeline(index, sync_id).await;
let mut updated_archives = updated_timeline
.checkpoints()
.map(ArchiveId)
.collect::<Vec<_>>();
assert_eq!(
updated_archives.len(),
2,
"Two archives are expected after a successful update of the upload"
);
updated_archives.retain(|archive_id| archive_id != &first_uploaded_archive);
assert_eq!(
updated_archives.len(),
1,
"Only one new archive is expected among the uploaded"
);
let second_uploaded_archive = updated_archives.last().copied().unwrap();
assert_eq!(
updated_timeline.checkpoints().max(),
Some(second_upload_metadata.disk_consistent_lsn()),
"Metadata that was uploaded, should have its Lsn stored"
);
assert_eq!(
updated_timeline
.archive_data(second_uploaded_archive)
.unwrap()
.disk_consistent_lsn(),
second_upload_metadata.disk_consistent_lsn(),
"Uploaded archive should have corresponding Lsn"
);
assert_eq!(
updated_timeline.stored_files(&local_timeline_path),
vec![
local_timeline_path.join("a"),
local_timeline_path.join("b"),
local_timeline_path.join("c"),
]
.into_iter()
.collect(),
"Should have all files from both checkpoints without duplicates"
);
let third_upload_metadata = dummy_metadata(Lsn(0x20));
let third_checkpoint = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&["d"],
third_upload_metadata.clone(),
)?;
assert_ne!(
third_upload_metadata.disk_consistent_lsn(),
first_upload_metadata.disk_consistent_lsn()
);
assert!(
third_upload_metadata.disk_consistent_lsn()
< second_upload_metadata.disk_consistent_lsn()
);
ensure_correct_timeline_upload(
&repo_harness,
Arc::clone(&remote_assets),
TIMELINE_ID,
third_checkpoint,
)
.await;
let updated_timeline = expect_timeline(index, sync_id).await;
let mut updated_archives = updated_timeline
.checkpoints()
.map(ArchiveId)
.collect::<Vec<_>>();
assert_eq!(
updated_archives.len(),
3,
"Three archives are expected after two successful updates of the upload"
);
updated_archives.retain(|archive_id| {
archive_id != &first_uploaded_archive && archive_id != &second_uploaded_archive
});
assert_eq!(
updated_archives.len(),
1,
"Only one new archive is expected among the uploaded"
);
let third_uploaded_archive = updated_archives.last().copied().unwrap();
assert!(
updated_timeline.checkpoints().max().unwrap()
> third_upload_metadata.disk_consistent_lsn(),
"Should not influence the last lsn by uploading an older checkpoint"
);
assert_eq!(
updated_timeline
.archive_data(third_uploaded_archive)
.unwrap()
.disk_consistent_lsn(),
third_upload_metadata.disk_consistent_lsn(),
"Uploaded archive should have corresponding Lsn"
);
assert_eq!(
updated_timeline.stored_files(&local_timeline_path),
vec![
local_timeline_path.join("a"),
local_timeline_path.join("b"),
local_timeline_path.join("c"),
local_timeline_path.join("d"),
]
.into_iter()
.collect(),
"Should have all files from three checkpoints without duplicates"
);
Ok(())
}
#[tokio::test]
async fn reupload_timeline_rejected() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("reupload_timeline_rejected")?;
let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?;
let index = RwLock::new(RemoteTimelineIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
storage
.list()
.await?
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
));
let remote_assets = Arc::new((storage, index));
let storage = &remote_assets.0;
let index = &remote_assets.1;
let first_upload_metadata = dummy_metadata(Lsn(0x10));
let first_checkpoint = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&["a", "b"],
first_upload_metadata.clone(),
)?;
ensure_correct_timeline_upload(
&repo_harness,
Arc::clone(&remote_assets),
TIMELINE_ID,
first_checkpoint,
)
.await;
let after_first_uploads = RemoteTimelineIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
remote_assets
.0
.list()
.await
.unwrap()
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
);
let normal_upload_metadata = dummy_metadata(Lsn(0x20));
assert_ne!(
normal_upload_metadata.disk_consistent_lsn(),
first_upload_metadata.disk_consistent_lsn()
);
let checkpoint_with_no_files = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&[],
normal_upload_metadata.clone(),
)?;
upload_timeline_checkpoint(
repo_harness.conf,
Arc::clone(&remote_assets),
sync_id,
checkpoint_with_no_files,
0,
)
.await;
assert_index_descriptions(index, after_first_uploads.clone()).await;
let checkpoint_with_uploaded_lsn = create_local_timeline(
&repo_harness,
TIMELINE_ID,
&["something", "new"],
first_upload_metadata.clone(),
)?;
upload_timeline_checkpoint(
repo_harness.conf,
Arc::clone(&remote_assets),
sync_id,
checkpoint_with_uploaded_lsn,
0,
)
.await;
assert_index_descriptions(index, after_first_uploads.clone()).await;
Ok(())
}
}

File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More