Compare commits

..

125 Commits

Author SHA1 Message Date
Bojan Serafimov
a74ca297e1 Write proposer binary 2022-04-20 23:00:43 -04:00
Bojan Serafimov
c8c76a8b99 Add todo 2022-04-19 17:32:14 -04:00
Bojan Serafimov
36673ff709 Write safekeeper fuzz test sketch 2022-04-19 17:18:08 -04:00
bojanserafimov
ef72eb84cf Remove zenfixture (#1534) 2022-04-19 09:46:47 -04:00
Kirill Bulatov
a1e34772e5 Improve compute error logging 2022-04-19 00:20:08 +03:00
Stas Kelvich
389bd1faeb Support for SCRAM-SHA-256 in compute tools 2022-04-18 22:19:01 +03:00
Anastasia Lubennikova
c15aa04714 Move Cluster size limit RFC from rfcs repo 2022-04-18 18:11:31 +03:00
Kirill Bulatov
52e0816fa5 wal_acceptor -> safekeeper 2022-04-18 12:52:31 +03:00
Kirill Bulatov
81417788c8 walkeeper -> safekeeper 2022-04-18 12:52:31 +03:00
Kirill Bulatov
81879f8137 Restore missing cachepot env vars 2022-04-18 12:32:04 +03:00
Arseny Sher
5b29774532 Small refactoring after ec3bc74165.
Move record_safekeeper_info inside safekeeper.rs, fix commit_lsn update, sync
control file.
2022-04-18 13:11:34 +04:00
Kirill Bulatov
0ca2bd929b Remove log crate from pageserver 2022-04-18 00:00:36 +03:00
Kirill Bulatov
9b7dcc2bae Use proper cachepot bucket 2022-04-17 16:35:40 +03:00
Kirill Bulatov
3136a0754a Use mold in Docker images 2022-04-17 00:50:28 +03:00
Kirill Bulatov
787f0d33f0 Use another cachepot bucket for rust Docker build caches 2022-04-16 23:36:42 +03:00
Kirill Bulatov
ed5f9acca9 Revert "Revert libc upgrade" (#1527)
This reverts commit 4bc338babc.
2022-04-16 13:38:48 +03:00
Kirill Bulatov
4bc338babc Revert libc upgrade 2022-04-16 10:03:26 +03:00
Kirill Bulatov
3ab090b43a Fix compute tools build 2022-04-15 23:12:35 +03:00
Kirill Bulatov
7126979950 Remove custom neon Docker build image 2022-04-15 20:08:22 +03:00
Arseny Sher
9946cd1125 Bump vendor/postgres to add safekeeper connection timeout. 2022-04-15 20:44:56 +04:00
Dmitry Ivanov
ab20f2c491 Use the same version of rust-postgres everywhere. (#1516)
Turns out we still had a stale dep in `compute_tools`.
2022-04-15 18:36:11 +03:00
Dmitry Ivanov
c9d897f9b6 [proxy] Update rustls (#1510) 2022-04-15 12:06:25 +03:00
Kirill Bulatov
e97f94cc30 Bump rustc version 2022-04-14 23:01:06 +03:00
Dmitry Rodionov
2cb39a1624 add missing files, update workspace hack 2022-04-14 20:41:21 +03:00
Heikki Linnakangas
93e0ac2b7a Remove a couple of unused dependencies.
Found by "cargo-udeps"
2022-04-14 17:38:26 +03:00
bojanserafimov
d5ae9db997 Add s3 cost estimate to tests (#1478) 2022-04-14 10:09:03 -04:00
Heikki Linnakangas
9e4de6bed0 Use RwLock instad of Mutex for layer map lock.
For more concurrency
2022-04-14 13:34:01 +03:00
Heikki Linnakangas
4a8c663452 Refactor pgbench tests.
- Remove batch_others/test_pgbench.py. It was a quick check that pgbench
  works, without actually recording any performance numbers, but that
  doesn't seem very interesting anymore. Remove it to avoid confusing it
  with the actual pgbench benchmarks

- Run pgbench with "-n" and "-S" options, for two different workloads:
  simple-updates, and SELECT-only. Previously, we would only run it with
  the "default" TPCB-like workload. That's more or less the same as the
  simple-update (-n) workload, but I think the simple-upload workload
  is more relevant for testing storage performance. The SELECT-only
  workload is a new thing to measure.

- Merge test_perf_pgbench.py and test_perf_pgbench_remote.py. I added
  a new "remote" implementation of the PgCompare class, which allows
  running the same tests against an already-running Postgres instance.

- Make the PgBenchRunResult.parse_from_output function more
  flexible. pgbench can print different lines depending on the
  command-line options, but the parsing function expected a particular
  set of lines.
2022-04-14 13:31:42 +03:00
Heikki Linnakangas
a009fe912a Refactor connection option handling in python tests
The PgProtocol.connect() function took extra options for username,
database, etc. Remove those options, and have a generic way for each
subclass of PgProtocol to provide some default options, with the
capability override them in the connect() call.
2022-04-14 13:31:40 +03:00
Heikki Linnakangas
19954dfd8a Refactor proxy options test to not rely on the 'schema' argument.
It was the only test that used the 'schema' argument to the connect()
function. I'm about to refactor the option handling and will remove
the special 'schema' argument altogether, so rewrite the test to not
use it.
2022-04-14 13:31:37 +03:00
Heikki Linnakangas
570db6f168 Update README for Zenith -> Neon renaming.
There's a lot of renaming left to do in the code and docs, but this is
a start. Our binaries and many other things are still called "zenith",
but I didn't change those in the README, because otherwise the
examples won't work. I added a brief note at the top of the README to
explain that we're in the process of renaming, until we've renamed
everything.
2022-04-14 11:30:01 +03:00
Arthur Petukhovsky
cdf04b6a9f Fix control file updates in safekeeper (#1452)
Now control_file::Storage implements Deref for read-only access to the state. All updates should clone the state before modifying and persisting.
2022-04-14 09:31:35 +03:00
Dhammika Pathirana
a0781f229c Add ps compact command
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add ps compact command to api (#707) (#1484)
2022-04-13 22:47:13 -07:00
Dmitry Rodionov
1d36c5a39e reenable s3 on staging pagservers by default
After deadlockk fix in https://github.com/neondatabase/neon/pull/1496 s3
seems to work normally. There is one more discovered issue but it is not
a blocker so can be fixed separately.
2022-04-13 20:10:39 +03:00
Dmitry Rodionov
49da76237b remove noisy debug log message 2022-04-13 19:50:31 +03:00
Dhammika Pathirana
1fd08107ca Add ps compaction_threshold config
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add ps compaction_threadhold knob for (#707) (#1484)
2022-04-13 07:42:58 -07:00
Daniil
58d5136a61 compute_tools: check writability handler (#941) 2022-04-13 17:16:25 +03:00
Arthur Petukhovsky
87020f8126 Fix CI staging deploy (#1499)
- Remove stopped safekeeper from inventory
- Fix github pages address after neon rename
2022-04-13 10:59:29 +03:00
Dmitry Rodionov
20414c4b16 defuse possible deadlock in download_timeline too 2022-04-13 10:05:19 +03:00
Dmitry Rodionov
9b7a8e67a4 fix deadlock in upload_timeline_checkpoint
It originated from the fact that we were calling to fetch_full_index
without releasing the read guard, and fetch_full_index tries to acquire
read again. For plain mutex it is already a deeadlock, for RW lock
deadlock was achieved by an attempt to acquire write access later in the
code while still having active read guard up in the stack

This is sort of a bandaid because Kirill plans to change this code
during removal of an archiving mechanism
2022-04-13 10:05:19 +03:00
Dmitry Ivanov
4af87f3d60 [proxy] Add SCRAM auth mechanism implementation (#1050)
* [proxy] Add SCRAM auth

* [proxy] Implement some tests for SCRAM

* Refactoring + test fixes

* Hide SCRAM mechanism behind `#[cfg(test)]`

Currently we only use it in tests, so we hide all relevant
module behind `#[cfg(test)]` to prevent "unused item" warnings.
2022-04-13 03:00:32 +03:00
Alexey Kondratov
0fbe657b2f Fix remote e2e tests after repository rename (#1434)
Also start them after release build instead of debug. It saves 3-5
minutes and we anyway use release mode in Docker images.
2022-04-13 00:02:06 +03:00
Konstantin Knizhnik
07a9553700 Add test for restore from WAL (#1366)
* Add test for restore from WAL

* Fix python formatting

* Choose unused port in wal restore test

* Move recovery tests to zenith_utils/scripts

* Set LD_LIBRARY_PATH in wal recovery scripts

* Fix python test formatting

* Fix mypy warning

* Bump postgres version

* Bump postgres version
2022-04-11 22:30:08 +03:00
Kirill Bulatov
dc7e3ff05a Fix rustc 1.60 clippy warnings 2022-04-11 21:34:04 +03:00
Kirill Bulatov
4f172e7612 Replicate S3 blob metadata in the remote storage 2022-04-11 21:34:04 +03:00
Kirill Bulatov
0e9ee772af Use rusoto in safekeeper 2022-04-11 21:34:04 +03:00
Kirill Bulatov
db63fa64ae Use rusoto lib for S3 relish_storage impl 2022-04-11 21:34:04 +03:00
Arthur Petukhovsky
8e2a6661e9 Make wal_storage initialization eager (#1489) 2022-04-11 20:36:26 +03:00
Heikki Linnakangas
214567bf8f Use B-tree for the index in image and delta layers.
We now use a page cache for those, instead of slurping the whole index into
memory.

Fixes https://github.com/zenithdb/zenith/issues/1356

This is a backwards-incompatible change to the storage format, so
bump STORAGE_FORMAT_VERSION.
2022-04-07 20:58:55 +03:00
Heikki Linnakangas
c4b57e4b8f Move BlobRef
It's not needed in image layers anymore, so move it into delta_layer.rs
2022-04-07 20:58:55 +03:00
Heikki Linnakangas
5d9851f5d1 Refactor the I/O functions.
This introduces two new abstraction layers for I/O:

- Block I/O, and
- Blob I/O.

The BlockReader trait abstracts a file or something else that can be read
in 8kB pages. It is implemented by EphemeralFiles, and by a new
FileBlockReader struct that allows reading arbitrary VirtualFiles in that
manner, utilizing the page cache.

There is also a new BlockCursor struct that works as a cursor over a
BlockReader. When you create a BlockCursor and read the first page using
it, it keeps the reference to the page. If you access the same page again,
it avoids going to page cache and quickly returns the same page again.
That can save a lot of lookups in the page cache if you perform multiple
reads.

The Blob-oriented API allows reading and writing "blobs" of arbitrary
length. It is a layer on top of the block-oriented API. When you write
a blob with the write_blob() function, it writes a length field
followed by the actual data to the underlying block storage, and
returns the offset where the blob was stored. The blob can be
retrieved later using the offset.

Finally, this replaces the I/O code in image-, delta-, and in-memory
layers to use the new abstractions. These replace the 'bookfile'
crate.

This is a backwards-incompatible change to the storage format.
2022-04-07 20:58:54 +03:00
Arthur Petukhovsky
81ba23094e Fix scripts to deploy sk4 on staging (#1476)
Adjust ansible scripts and inventory for sk4 on staging
2022-04-07 20:38:26 +03:00
bojanserafimov
d5258cdc4d [proxy] Don't print passwords (#1298) 2022-04-06 20:05:24 -04:00
Arthur Petukhovsky
6bc78a0e77 Log more info in test_many_timelines asserts (#1473)
It will help to debug #1470 as soon as it happens again
2022-04-07 01:44:26 +03:00
bojanserafimov
6fe443e239 Improve random_writes test (#1469)
If you want to test with a 3GB database by tweaking some constants you'll hit a query timeout. I fix that by batching the inserts.
2022-04-06 18:32:10 -04:00
Alexey Kondratov
d0c246ac3c Update pageserver OpenAPI spec with missing attach/detach methods (#1463)
We have these methods for some time in the API, so mentioning them in the
spec could be useful for console (see zenithdb/console#867), as we generate
pageserver HTTP API golang client there.
2022-04-05 20:01:57 +03:00
Heikki Linnakangas
2f784144fe Avoid deadlock when locking two buffers.
It happened in unit tests. If a thread tries to read a buffer while
already holding a lock on one buffer, the code to find a victim buffer
to evict could try to evict the buffer that's already locked. To fix,
skip locked buffers.
2022-04-04 20:12:31 +03:00
Heikki Linnakangas
222b723354 Handle read errors when dumping a delta layer file.
If a file is corrupt, let's not stop on first read error, but continue
dumping.
2022-04-04 20:12:28 +03:00
Heikki Linnakangas
089ba6abfe Clean up some comments that still referred to 'segments' 2022-04-04 20:12:25 +03:00
Arthur Petukhovsky
a5a478c321 Bump vendor/postgres to store WAL on disk only (#1342)
Now WAL is no longer held in compute memory
2022-04-04 16:32:30 +03:00
Konstantin Knizhnik
fcf613b6e3 Fix unit tests build 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
572b3f48cf Add compaction_target_size parameter 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
bef9b837f1 Replace rwlock with mutex in repartition 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
232fe14297 Refactor partitioning 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
92031d376a Fix unit tests 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
1f0b406b63 Perform repartitioning in compaction thread
refer #1441
2022-04-04 10:43:27 +03:00
Kirill Bulatov
4c9447589a Place an info span into gc loop step 2022-04-03 19:30:36 +03:00
Kirill Bulatov
9e5423c867 Assert in a more informative way 2022-04-03 19:30:36 +03:00
Kirill Bulatov
43c16c5145 Don't log ZIds in the timeline load span 2022-04-03 19:30:36 +03:00
bojanserafimov
af712798e7 Fix pageserver readme formatting
I put the diagram in a fixed-width block, since it wasn't rendering correctly on github.
2022-04-02 00:36:54 +03:00
Dmitry Ivanov
f5da652388 [proxy] Enable keepalives for all tcp connections (#1448) 2022-03-31 20:44:57 +03:00
Anastasia Lubennikova
8745b022a9 Extend LayerMap dump() function to print also open_layers and frozen_layers.
Add verbose option to chose if we need to print all layer's keys or not.
2022-03-31 17:26:24 +03:00
Arthur Petukhovsky
a40b7cd516 Fix timeouts in test_restarts_under_load (#1436)
* Enable backpressure in test_restarts_under_load

* Remove hacks because #644 is fixed now

* Adjust config in test_restarts_under_load
2022-03-31 17:00:09 +03:00
Konstantin Knizhnik
1aa8fe43cf Fix race condition in image layer (#1440)
* Fix race condition in image layer

refer #1439

* Add explicit drop(inner) in layer load method

* Add explicit drop(inner) in layer load method
2022-03-31 15:47:59 +03:00
Dmitry Rodionov
649f324fe3 make logging in basebackup more consistent 2022-03-30 17:58:51 +03:00
Dmitry Rodionov
8609234204 decrease the log level to debug because it is too noisy 2022-03-30 10:13:38 +03:00
Anton Shyrabokau
5c5629910f Add a test case for reading historic page versions (#1314)
* Add a test case for reading historic page versions

 Test read_page_at_lsn returns correct results when compared to page inspect.
 Validate possiblity of reading pages from dropped relation.
 Ensure funcitons read latest version when null lsn supplied.
 Check that functions do not poison buffer cache with stale page versions.
2022-03-29 22:13:06 -07:00
Kirill Bulatov
277e41f4b7 Show s3 spans in logs and improve the log messages 2022-03-29 19:21:31 +03:00
Arthur Petukhovsky
ce0243bc12 Add metric for last_record_lsn (#1430) 2022-03-29 18:54:24 +03:00
Arseny Sher
ec3bc74165 Add safekeeper information exchange through etcd.
Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is
WAL truncation, for which every safekeeper must know remote_consistent_lsn; the
next would be callmemaybe replacement.

Adds corresponding '--broker' argument to safekeeper and ability to run etcd in
tests.

Adds test checking remote_consistent_lsn is indeed communicated.
2022-03-29 18:16:49 +04:00
Dmitry Rodionov
9594362f74 change python cache version to 2 (fixes python cache in circle CI) 2022-03-29 10:42:30 +03:00
Dmitry Rodionov
eee0f51e0c use cargo-hakari to manage workspace_hack crate
workspace_hack is needed to avoid recompilation when different crates
inside the workspace depend on the same packages but with different
features being enabled. Problem occurs when you build crates separately
one by one. So this is irrelevant to our CI setup because there we build
all binaries at once, but it may be relevant for local development.

this also changes cargo's resolver version to 2
2022-03-29 10:42:04 +03:00
Arthur Petukhovsky
fd78110c2b Add default statement_timeout for tests (#1423) 2022-03-29 09:57:00 +03:00
Anton Shyrabokau
be6a6958e2 CI: rebuild postgres when Makefile changes (#1429) 2022-03-28 18:19:20 -07:00
Kirill Bulatov
0e44887929 Show more S3 logs and less verbove WAL logs 2022-03-29 00:36:06 +03:00
Dhammika Pathirana
1aa57fc262 Fix tone down compact log chatter
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-03-28 13:24:13 -07:00
Alexey Kondratov
9a4f0930c0 Turn off S3 for pageserver on staging 2022-03-28 14:14:17 -05:00
Alexey Kondratov
d88f8b4a7e Fix storage deploy condition in ansible playbook 2022-03-28 13:30:40 -05:00
Arthur Petukhovsky
8a901de52a Refactor control file update at safekeeper.
Record global_commit_lsn, have common routine for control file update, add
SafekeeperMemstate.
2022-03-28 21:52:12 +04:00
Alexey Kondratov
a883202495 Enable S3 for pageserver on staging
Follow-up for #1417. Previously we had a problem uploading to S3
due to huge ammount of existing not yet uploaded data. Now we have a
fresh pageserver with LSM storage on staging, so we can try enabling it
once again.
2022-03-28 12:04:40 -05:00
Arseny Sher
780b46ad27 Bump vendor/postgres to fix commit_lsn going backwards. 2022-03-28 20:37:33 +04:00
Arseny Sher
75002adc14 Make shared_buffers large in test_pageserver_catchup.
We intentionally write while pageserver is down, so we shouldn't query it.

Noticed by @petuhovskiy at
https://github.com/zenithdb/postgres/pull/141#issuecomment-1080261700
2022-03-28 20:34:06 +04:00
Heikki Linnakangas
07342f7519 Major storage format rewrite.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.

Mapping from Postgres data directory to keys/values
---------------------------------------------------

All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.

The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.

'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.

The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.

Changes to low-level layer file code
------------------------------------

The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.

Checkpointing, compaction
-------------------------

The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.

Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.

Deployment
----------

This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.

Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
2022-03-28 05:41:15 -05:00
Kirill Bulatov
55de0b88f5 Hide remote timeline index access details 2022-03-28 12:36:01 +03:00
Kirill Bulatov
d56a0ee19a Avoid recompiling tests for release profile 2022-03-26 08:38:45 +02:00
Kirill Bulatov
18dfc769d8 Use cachepot to cache more rustc builds 2022-03-26 08:38:45 +02:00
Heikki Linnakangas
5e04dad360 Add more variants of the sequential scan performance tests.
More rows, and test with serial and parallel plans. But fewer iterations,
so that the tests run in < 1 minutes, and we don't need to mark them as
"slow".
2022-03-25 23:42:13 +02:00
Dmitry Rodionov
b8cba059a5 temporary disable s3 integration on staging until LSM storge rewrite lands 2022-03-26 00:19:25 +04:00
Heikki Linnakangas
e3fa00972e Use RwLocks in image and delta layers for more concurrency.
With a Mutex, only one thread could read from the layer at a time. I did
some ad hoc profiling with pgbench and saw that a fair amout of time was
spent blocked on these Mutexes.
2022-03-25 15:34:38 +02:00
Kirill Bulatov
b39d1b1717 Exit only on important thread failures 2022-03-25 11:58:54 +02:00
Kirill Bulatov
28bc8e3f5c Log pageserver threads better and shut down on errors in them 2022-03-25 11:58:54 +02:00
Kirill Bulatov
6244fd9e7e Better error messages on zenith cli subcommand invocations 2022-03-25 11:58:54 +02:00
Kirill Bulatov
f6b1d76c30 Replace assert! with ensure! for anyhow::Result functions 2022-03-25 11:58:54 +02:00
Kirill Bulatov
edc7bebcb5 Remove obvious panic sources 2022-03-25 11:58:54 +02:00
Kirill Bulatov
a201d33edc Properly print cachepot stats 2022-03-24 21:11:02 +02:00
Heikki Linnakangas
825d363170 Remove some unnecessary Ord etc. trait implementations.
It doesn't make much sense to compare TimelineMetadata structs with
< or >. But we depended on that in the remote storage upload code,
so replace BTreeSets with Vecs there.
2022-03-24 12:20:06 +02:00
Dmitry Rodionov
b9a1a75b0d clean up unused imports in python tests 2022-03-24 12:47:22 +04:00
Dmitry Rodionov
d3a9cb44a6 tweak timeouts for tenant relocation test 2022-03-24 12:47:22 +04:00
Heikki Linnakangas
c718870517 Tiny refactoring of page_cache::init function.
The init function only needs the 'page_cache_size' from the config, so
seems slightly nicer to pass just that.
2022-03-24 09:46:07 +02:00
Dmitry Rodionov
8437fc056e some follow ups after s3 integration was enabled on staging
* do not error out when upload file list is empty
* ignore ephemeral files during sync initialization
2022-03-23 23:35:36 +04:00
Dmitry Rodionov
8b8d78a3a0 use main branch of our bookfile crate 2022-03-23 22:05:43 +04:00
Dmitry Rodionov
8a86276a6e add more context to error 2022-03-23 18:38:15 +04:00
Dmitry Rodionov
0be7ed0cb5 decrease log message severity for timeline checkpoint internals 2022-03-23 18:20:43 +04:00
Dmitry Rodionov
e80ae4306a change log level from info to debug for timeline gc messages 2022-03-23 18:20:43 +04:00
Heikki Linnakangas
123fcd5d0d Revert accidental bump of vendor/postgres submodule
I accidentally bumped it in commit 3b069f5aef. It didn't seem to cause
any harm, but it was not intentional.
2022-03-23 15:45:29 +02:00
Kirill Bulatov
15434ba7e0 Show cachepot build stats 2022-03-23 14:12:59 +02:00
Andrey Taranik
a4d0d78e9e s3 settings for pageserver (#1388) 2022-03-23 13:39:55 +03:00
Dmitry Rodionov
e13bdd77fe add safekepeers gossip annd storage messaging rfcs
they were in prs during rfc repo import

in addition to just import I've added sequence diagrams to storage
messaging rfc
2022-03-22 15:01:26 +04:00
Kirill Bulatov
bd6bef468c Provide single list timelines HTTP API handle 2022-03-21 13:42:21 +02:00
Kirill Bulatov
77ed2a0fa0 Run GitHub testing workflow on every push 2022-03-21 12:46:33 +02:00
Kirill Bulatov
37ebbb598d Add a macOs build 2022-03-21 12:46:33 +02:00
Kirill Bulatov
063f9ba81d Use serde_with to (de)serialize ZId and Lsn to hex 2022-03-21 12:46:07 +02:00
Heikki Linnakangas
3b069f5aef Fix name of directory used in unit test.
There's another test called 'timeline_load'. If the two tests run in
parallel, they would conflict and fail.
2022-03-18 21:27:48 +02:00
Dmitry Rodionov
b19870cd88 guard against partial uploads to local storage 2022-03-18 18:14:57 +03:00
Dmitry Rodionov
7738254f83 refactor timeline memory state management 2022-03-18 18:14:57 +03:00
195 changed files with 16695 additions and 8960 deletions

2
.circleci/ansible/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
zenith_install.tar.gz
.zenith_current_version

View File

@@ -1,14 +1,11 @@
- name: Upload Zenith binaries
hosts: pageservers:safekeepers
hosts: storage
gather_facts: False
remote_user: admin
vars:
force_deploy: false
tasks:
- name: get latest version of Zenith binaries
ignore_errors: true
register: current_version_file
set_fact:
current_version: "{{ lookup('file', '.zenith_current_version') | trim }}"
@@ -16,48 +13,13 @@
- pageserver
- safekeeper
- name: set zero value for current_version
when: current_version_file is failed
set_fact:
current_version: "0"
tags:
- pageserver
- safekeeper
- name: get deployed version from content of remote file
ignore_errors: true
ansible.builtin.slurp:
src: /usr/local/.zenith_current_version
register: remote_version_file
tags:
- pageserver
- safekeeper
- name: decode remote file content
when: remote_version_file is succeeded
set_fact:
remote_version: "{{ remote_version_file['content'] | b64decode | trim }}"
tags:
- pageserver
- safekeeper
- name: set zero value for remote_version
when: remote_version_file is failed
set_fact:
remote_version: "0"
tags:
- pageserver
- safekeeper
- name: inform about versions
debug: msg="Version to deploy - {{ current_version }}, version on storage node - {{ remote_version }}"
debug: msg="Version to deploy - {{ current_version }}"
tags:
- pageserver
- safekeeper
- name: upload and extract Zenith binaries to /usr/local
when: current_version > remote_version or force_deploy
ansible.builtin.unarchive:
owner: root
group: root
@@ -74,14 +36,24 @@
hosts: pageservers
gather_facts: False
remote_user: admin
vars:
force_deploy: false
tasks:
- name: upload init script
when: console_mgmt_base_url is defined
ansible.builtin.template:
src: scripts/init_pageserver.sh
dest: /tmp/init_pageserver.sh
owner: root
group: root
mode: '0755'
become: true
tags:
- pageserver
- name: init pageserver
when: current_version > remote_version or force_deploy
shell:
cmd: sudo -u pageserver /usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" --init -D /storage/pageserver/data
cmd: /tmp/init_pageserver.sh
args:
creates: "/storage/pageserver/data/tenants"
environment:
@@ -91,8 +63,20 @@
tags:
- pageserver
- name: update remote storage (s3) config
lineinfile:
path: /storage/pageserver/data/pageserver.toml
line: "{{ item }}"
loop:
- "[remote_storage]"
- "bucket_name = '{{ bucket_name }}'"
- "bucket_region = '{{ bucket_region }}'"
- "prefix_in_bucket = '{{ inventory_hostname }}'"
become: true
tags:
- pageserver
- name: upload systemd service definition
when: current_version > remote_version or force_deploy
ansible.builtin.template:
src: systemd/pageserver.service
dest: /etc/systemd/system/pageserver.service
@@ -104,7 +88,6 @@
- pageserver
- name: start systemd service
when: current_version > remote_version or force_deploy
ansible.builtin.systemd:
daemon_reload: yes
name: pageserver
@@ -115,7 +98,7 @@
- pageserver
- name: post version to console
when: (current_version > remote_version or force_deploy) and console_mgmt_base_url is defined
when: console_mgmt_base_url is defined
shell:
cmd: |
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
@@ -127,22 +110,42 @@
hosts: safekeepers
gather_facts: False
remote_user: admin
vars:
force_deploy: false
tasks:
- name: upload init script
when: console_mgmt_base_url is defined
ansible.builtin.template:
src: scripts/init_safekeeper.sh
dest: /tmp/init_safekeeper.sh
owner: root
group: root
mode: '0755'
become: true
tags:
- safekeeper
- name: init safekeeper
shell:
cmd: /tmp/init_safekeeper.sh
args:
creates: "/storage/safekeeper/data/safekeeper.id"
environment:
ZENITH_REPO_DIR: "/storage/safekeeper/data"
LD_LIBRARY_PATH: "/usr/local/lib"
become: true
tags:
- safekeeper
# in the future safekeepers should discover pageservers byself
# but currently use first pageserver that was discovered
- name: set first pageserver var for safekeepers
when: current_version > remote_version or force_deploy
set_fact:
first_pageserver: "{{ hostvars[groups['pageservers'][0]]['inventory_hostname'] }}"
tags:
- safekeeper
- name: upload systemd service definition
when: current_version > remote_version or force_deploy
ansible.builtin.template:
src: systemd/safekeeper.service
dest: /etc/systemd/system/safekeeper.service
@@ -154,7 +157,6 @@
- safekeeper
- name: start systemd service
when: current_version > remote_version or force_deploy
ansible.builtin.systemd:
daemon_reload: yes
name: safekeeper
@@ -165,7 +167,7 @@
- safekeeper
- name: post version to console
when: (current_version > remote_version or force_deploy) and console_mgmt_base_url is defined
when: console_mgmt_base_url is defined
shell:
cmd: |
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)

View File

@@ -1,7 +1,16 @@
[pageservers]
zenith-1-ps-1
zenith-1-ps-1 console_region_id=1
[safekeepers]
zenith-1-sk-1
zenith-1-sk-2
zenith-1-sk-3
zenith-1-sk-1 console_region_id=1
zenith-1-sk-2 console_region_id=1
zenith-1-sk-3 console_region_id=1
[storage:children]
pageservers
safekeepers
[storage:vars]
console_mgmt_base_url = http://console-release.local
bucket_name = zenith-storage-oregon
bucket_region = us-west-2

View File

@@ -0,0 +1,30 @@
#!/bin/sh
# get instance id from meta-data service
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
# store fqdn hostname in var
HOST=$(hostname -f)
cat <<EOF | tee /tmp/payload
{
"version": 1,
"host": "${HOST}",
"port": 6400,
"region_id": {{ console_region_id }},
"instance_id": "${INSTANCE_ID}",
"http_host": "${HOST}",
"http_port": 9898
}
EOF
# check if pageserver already registered or not
if ! curl -sf -X PATCH -d '{}' {{ console_mgmt_base_url }}/api/v1/pageservers/${INSTANCE_ID} -o /dev/null; then
# not registered, so register it now
ID=$(curl -sf -X POST {{ console_mgmt_base_url }}/api/v1/pageservers -d@/tmp/payload | jq -r '.ID')
# init pageserver
sudo -u pageserver /usr/local/bin/pageserver -c "id=${ID}" -c "pg_distrib_dir='/usr/local'" --init -D /storage/pageserver/data
fi

View File

@@ -0,0 +1,30 @@
#!/bin/sh
# get instance id from meta-data service
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
# store fqdn hostname in var
HOST=$(hostname -f)
cat <<EOF | tee /tmp/payload
{
"version": 1,
"host": "${HOST}",
"port": 6500,
"region_id": {{ console_region_id }},
"instance_id": "${INSTANCE_ID}",
"http_host": "${HOST}",
"http_port": 7676
}
EOF
# check if safekeeper already registered or not
if ! curl -sf -X PATCH -d '{}' {{ console_mgmt_base_url }}/api/v1/safekeepers/${INSTANCE_ID} -o /dev/null; then
# not registered, so register it now
ID=$(curl -sf -X POST {{ console_mgmt_base_url }}/api/v1/safekeepers -d@/tmp/payload | jq -r '.ID')
# init safekeeper
sudo -u safekeeper /usr/local/bin/safekeeper --id ${ID} --init -D /storage/safekeeper/data
fi

View File

@@ -1,7 +1,17 @@
[pageservers]
zenith-us-stage-ps-1
#zenith-us-stage-ps-1 console_region_id=27
zenith-us-stage-ps-2 console_region_id=27
[safekeepers]
zenith-us-stage-sk-1
zenith-us-stage-sk-2
zenith-us-stage-sk-3
zenith-us-stage-sk-1 console_region_id=27
zenith-us-stage-sk-2 console_region_id=27
zenith-us-stage-sk-4 console_region_id=27
[storage:children]
pageservers
safekeepers
[storage:vars]
console_mgmt_base_url = http://console-staging.local
bucket_name = zenith-staging-storage-us-east-1
bucket_region = us-east-1

View File

@@ -5,10 +5,10 @@ executors:
resource_class: xlarge
docker:
# NB: when changed, do not forget to update rust image tag in all Dockerfiles
- image: zimg/rust:1.56
- image: zimg/rust:1.58
zenith-executor:
docker:
- image: zimg/rust:1.56
- image: zimg/rust:1.58
jobs:
check-codestyle-rust:
@@ -34,10 +34,13 @@ jobs:
- checkout
# Grab the postgres git revision to build a cache key.
# Append makefile as it could change the way postgres is built.
# Note this works even though the submodule hasn't been checkout out yet.
- run:
name: Get postgres cache key
command: git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
command: |
git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
cat Makefile >> /tmp/cache-key-postgres
- restore_cache:
name: Restore postgres cache
@@ -78,11 +81,14 @@ jobs:
- checkout
# Grab the postgres git revision to build a cache key.
# Append makefile as it could change the way postgres is built.
# Note this works even though the submodule hasn't been checkout out yet.
- run:
name: Get postgres cache key
command: |
git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
cat Makefile >> /tmp/cache-key-postgres
- restore_cache:
name: Restore postgres cache
@@ -111,7 +117,12 @@ jobs:
fi
export CARGO_INCREMENTAL=0
export CACHEPOT_BUCKET=zenith-rust-cachepot
export RUSTC_WRAPPER=cachepot
export AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}"
"${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests
cachepot -s
- save_cache:
name: Save rust cache
@@ -141,11 +152,13 @@ jobs:
command: |
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
CARGO_FLAGS=
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
CARGO_FLAGS=--release
fi
"${cov_prefix[@]}" cargo test
"${cov_prefix[@]}" cargo test $CARGO_FLAGS
# Install the rust binaries, for use by test jobs
- run:
@@ -215,12 +228,12 @@ jobs:
- checkout
- restore_cache:
keys:
- v1-python-deps-{{ checksum "poetry.lock" }}
- v2-python-deps-{{ checksum "poetry.lock" }}
- run:
name: Install deps
command: ./scripts/pysync
- save_cache:
key: v1-python-deps-{{ checksum "poetry.lock" }}
key: v2-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
@@ -274,12 +287,12 @@ jobs:
- run: git submodule update --init --depth 1
- restore_cache:
keys:
- v1-python-deps-{{ checksum "poetry.lock" }}
- v2-python-deps-{{ checksum "poetry.lock" }}
- run:
name: Install deps
command: ./scripts/pysync
- save_cache:
key: v1-python-deps-{{ checksum "poetry.lock" }}
key: v2-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
@@ -392,7 +405,7 @@ jobs:
- run:
name: Build coverage report
command: |
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1
scripts/coverage \
--dir=/tmp/zenith/coverage report \
@@ -403,8 +416,8 @@ jobs:
name: Upload coverage report
command: |
LOCAL_REPO=$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME
REPORT_URL=https://zenithdb.github.io/zenith-coverage-data/$CIRCLE_SHA1
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
REPORT_URL=https://neondatabase.github.io/zenith-coverage-data/$CIRCLE_SHA1
COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1
scripts/git-upload \
--repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-coverage-data.git \
@@ -464,7 +477,10 @@ jobs:
name: Build and push compute-tools Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
docker build -t zenithdb/compute-tools:latest -f Dockerfile.compute-tools .
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/compute-tools:latest -f Dockerfile.compute-tools .
docker push zenithdb/compute-tools:latest
- run:
name: Init postgres submodule
@@ -518,7 +534,10 @@ jobs:
name: Build and push compute-tools Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
docker build -t zenithdb/compute-tools:release -f Dockerfile.compute-tools .
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/compute-tools:release -f Dockerfile.compute-tools .
docker push zenithdb/compute-tools:release
- run:
name: Init postgres submodule
@@ -574,7 +593,7 @@ jobs:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add zenithdb https://zenithdb.github.io/helm-charts
helm repo add zenithdb https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
@@ -605,7 +624,7 @@ jobs:
ssh-add ssh-key
rm -f ssh-key ssh-key-cert.pub
ansible-playbook deploy.yaml -i production.hosts -e console_mgmt_base_url=http://console-release.local
ansible-playbook deploy.yaml -i production.hosts
rm -f zenith_install.tar.gz .zenith_current_version
deploy-release-proxy:
@@ -624,7 +643,7 @@ jobs:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add zenithdb https://zenithdb.github.io/helm-charts
helm repo add zenithdb https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
@@ -653,7 +672,7 @@ jobs:
--data \
"{
\"state\": \"pending\",
\"context\": \"zenith-remote-ci\",
\"context\": \"neon-cloud-e2e\",
\"description\": \"[$REMOTE_REPO] Remote CI job is about to start\"
}"
- run:
@@ -669,7 +688,7 @@ jobs:
"{
\"ref\": \"main\",
\"inputs\": {
\"ci_job_name\": \"zenith-remote-ci\",
\"ci_job_name\": \"neon-cloud-e2e\",
\"commit_hash\": \"$CIRCLE_SHA1\",
\"remote_repo\": \"$LOCAL_REPO\"
}
@@ -809,11 +828,11 @@ workflows:
- remote-ci-trigger:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN
remote_repo: "zenithdb/console"
remote_repo: "neondatabase/cloud"
requires:
# XXX: Successful build doesn't mean everything is OK, but
# the job to be triggered takes so much time to complete (~22 min)
# that it's better not to wait for the commented-out steps
- build-zenith-debug
- build-zenith-release
# - pg_regress-tests-release
# - other-tests-release

24
.config/hakari.toml Normal file
View File

@@ -0,0 +1,24 @@
# This file contains settings for `cargo hakari`.
# See https://docs.rs/cargo-hakari/latest/cargo_hakari/config for a full list of options.
hakari-package = "workspace_hack"
# Format for `workspace-hack = ...` lines in other Cargo.tomls. Requires cargo-hakari 0.9.8 or above.
dep-format-version = "2"
# Setting workspace.resolver = "2" in the root Cargo.toml is HIGHLY recommended.
# Hakari works much better with the new feature resolver.
# For more about the new feature resolver, see:
# https://blog.rust-lang.org/2021/03/25/Rust-1.51.0.html#cargos-new-feature-resolver
resolver = "2"
# Add triples corresponding to platforms commonly used by developers here.
# https://doc.rust-lang.org/rustc/platform-support.html
platforms = [
# "x86_64-unknown-linux-gnu",
# "x86_64-apple-darwin",
# "x86_64-pc-windows-msvc",
]
# Write out exact versions rather than a semver range. (Defaults to false.)
# exact-versions = true

View File

@@ -26,7 +26,7 @@ jobs:
runs-on: [self-hosted, zenith-benchmarker]
env:
PG_BIN: "/usr/pgsql-13/bin"
POSTGRES_DISTRIB_DIR: "/usr/pgsql-13"
steps:
- name: Checkout zenith repo
@@ -51,7 +51,7 @@ jobs:
echo Poetry
poetry --version
echo Pgbench
$PG_BIN/pgbench --version
$POSTGRES_DISTRIB_DIR/bin/pgbench --version
# FIXME cluster setup is skipped due to various changes in console API
# for now pre created cluster is used. When API gain some stability
@@ -66,7 +66,7 @@ jobs:
echo "Starting cluster"
# wake up the cluster
$PG_BIN/psql $BENCHMARK_CONNSTR -c "SELECT 1"
$POSTGRES_DISTRIB_DIR/bin/psql $BENCHMARK_CONNSTR -c "SELECT 1"
- name: Run benchmark
# pgbench is installed system wide from official repo
@@ -83,8 +83,11 @@ jobs:
# sudo yum install postgresql13-contrib
# actual binaries are located in /usr/pgsql-13/bin/
env:
TEST_PG_BENCH_TRANSACTIONS_MATRIX: "5000,10000,20000"
TEST_PG_BENCH_SCALES_MATRIX: "10,15"
# The pgbench test runs two tests of given duration against each scale.
# So the total runtime with these parameters is 2 * 2 * 300 = 1200, or 20 minutes.
# Plus time needed to initialize the test databases.
TEST_PG_BENCH_DURATIONS_MATRIX: "300"
TEST_PG_BENCH_SCALES_MATRIX: "10,100"
PLATFORM: "zenith-staging"
BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}"
REMOTE_ENV: "1" # indicate to test harness that we do not have zenith binaries locally

View File

@@ -1,10 +1,6 @@
name: Build and Test
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
on: push
jobs:
regression-check:
@@ -13,7 +9,7 @@ jobs:
# If we want to duplicate this job for different
# Rust toolchains (e.g. nightly or 1.37.0), add them here.
rust_toolchain: [stable]
os: [ubuntu-latest]
os: [ubuntu-latest, macos-latest]
timeout-minutes: 30
name: run regression test suite
runs-on: ${{ matrix.os }}
@@ -32,11 +28,17 @@ jobs:
toolchain: ${{ matrix.rust_toolchain }}
override: true
- name: Install postgres dependencies
- name: Install Ubuntu postgres dependencies
if: matrix.os == 'ubuntu-latest'
run: |
sudo apt update
sudo apt install build-essential libreadline-dev zlib1g-dev flex bison libseccomp-dev
- name: Install macOs postgres dependencies
if: matrix.os == 'macos-latest'
run: |
brew install flex bison
- name: Set pg revision for caching
id: pg_ver
run: echo ::set-output name=pg_rev::$(git rev-parse HEAD:vendor/postgres)

1091
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -5,19 +5,20 @@ members = [
"pageserver",
"postgres_ffi",
"proxy",
"walkeeper",
"safekeeper",
"workspace_hack",
"zenith",
"zenith_metrics",
"zenith_utils",
]
resolver = "2"
[profile.release]
# This is useful for profiling and, to some extent, debug.
# Besides, debug info should not affect the performance.
debug = true
# This is only needed for proxy's tests
# TODO: we should probably fork tokio-postgres-rustls instead
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
[patch.crates-io]
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }

View File

@@ -1,7 +1,5 @@
# Build Postgres
#
#FROM zimg/rust:1.56 AS pg-build
FROM zenithdb/build:buster-20220309 AS pg-build
FROM zimg/rust:1.58 AS pg-build
WORKDIR /pg
USER root
@@ -11,26 +9,24 @@ COPY Makefile Makefile
ENV BUILD_TYPE release
RUN set -e \
&& make -j $(nproc) -s postgres \
&& mold -run make -j $(nproc) -s postgres \
&& rm -rf tmp_install/build \
&& tar -C tmp_install -czf /postgres_install.tar.gz .
# Build zenith binaries
#
#FROM zimg/rust:1.56 AS build
FROM zenithdb/build:buster-20220309 AS build
FROM zimg/rust:1.58 AS build
ARG GIT_VERSION=local
ARG CACHEPOT_BUCKET=zenith-rust-cachepot
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
#ENV RUSTC_WRAPPER cachepot
ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot
COPY --from=pg-build /pg/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY . .
RUN cargo build --release
# Show build caching stats to check if it was used in the end.
# Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats.
RUN mold -run cargo build --release && cachepot -s
# Build final image
#

View File

@@ -1,23 +0,0 @@
FROM rust:1.56.1-slim-buster
WORKDIR /home/circleci/project
RUN set -e \
&& apt-get update \
&& apt-get -yq install \
automake \
libtool \
build-essential \
bison \
flex \
libreadline-dev \
zlib1g-dev \
libxml2-dev \
libseccomp-dev \
pkg-config \
libssl-dev \
clang
RUN set -e \
&& rustup component add clippy \
&& cargo install cargo-audit \
&& cargo install --git https://github.com/paritytech/cachepot

View File

@@ -1,14 +1,16 @@
# First transient image to build compute_tools binaries
# NB: keep in sync with rust image version in .circle/config.yml
FROM rust:1.56.1-slim-buster AS rust-build
FROM zimg/rust:1.58 AS rust-build
WORKDIR /zenith
ARG CACHEPOT_BUCKET=zenith-rust-cachepot
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
COPY . .
RUN cargo build -p compute_tools --release
RUN mold -run cargo build -p compute_tools --release && cachepot -s
# Final image that only has one binary
FROM debian:buster-slim
COPY --from=rust-build /zenith/target/release/zenith_ctl /usr/local/bin/zenith_ctl
COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl

View File

@@ -78,6 +78,11 @@ postgres: postgres-configure \
$(MAKE) -C tmp_install/build/contrib/zenith install
+@echo "Compiling contrib/zenith_test_utils"
$(MAKE) -C tmp_install/build/contrib/zenith_test_utils install
+@echo "Compiling pg_buffercache"
$(MAKE) -C tmp_install/build/contrib/pg_buffercache install
+@echo "Compiling pageinspect"
$(MAKE) -C tmp_install/build/contrib/pageinspect install
.PHONY: postgres-clean
postgres-clean:

View File

@@ -1,19 +1,22 @@
# Zenith
# Neon
Zenith is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
Neon is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
The project used to be called "Zenith". Many of the commands and code comments
still refer to "zenith", but we are in the process of renaming things.
## Architecture overview
A Zenith installation consists of compute nodes and Zenith storage engine.
A Neon installation consists of compute nodes and Neon storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by Zenith storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by Neon storage engine.
Zenith storage engine consists of two major components:
Neon storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
- WAL service. The service that receives WAL from compute node and ensures that it is stored durably.
Pageserver consists of:
- Repository - Zenith storage implementation.
- Repository - Neon storage implementation.
- WAL receiver - service that receives WAL from WAL service and stores it in the repository.
- Page service - service that communicates with compute nodes and responds with pages from the repository.
- WAL redo - service that builds pages from base images and WAL records on Page service request.
@@ -28,17 +31,17 @@ apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libsec
libssl-dev clang pkg-config libpq-dev
```
[Rust] 1.56.1 or later is also required.
[Rust] 1.58 or later is also required.
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
2. Build zenith and patched postgres
2. Build neon and patched postgres
```sh
git clone --recursive https://github.com/zenithdb/zenith.git
cd zenith
git clone --recursive https://github.com/neondatabase/neon.git
cd neon
make -j5
```
@@ -126,7 +129,7 @@ INSERT 0 1
## Running tests
```sh
git clone --recursive https://github.com/zenithdb/zenith.git
git clone --recursive https://github.com/neondatabase/neon.git
make # builds also postgres and installs it to ./tmp_install
./scripts/pytest
```
@@ -141,14 +144,14 @@ To view your `rustdoc` documentation in a browser, try running `cargo doc --no-d
### Postgres-specific terms
Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used.
Due to Neon's very close relation with PostgreSQL internals, there are numerous specific terms used.
Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
To get more familiar with this aspect, refer to:
- [Zenith glossary](/docs/glossary.md)
- [Neon glossary](/docs/glossary.md)
- [PostgreSQL glossary](https://www.postgresql.org/docs/13/glossary.html)
- Other PostgreSQL documentation and sources (Zenith fork sources can be found [here](https://github.com/zenithdb/postgres))
- Other PostgreSQL documentation and sources (Neon fork sources can be found [here](https://github.com/neondatabase/postgres))
## Join the development

View File

@@ -11,9 +11,11 @@ clap = "3.0"
env_logger = "0.9"
hyper = { version = "0.14", features = ["full"] }
log = { version = "0.4", features = ["std", "serde"] }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
regex = "1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tar = "0.4"
tokio = { version = "1", features = ["macros", "rt", "rt-multi-thread"] }
tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }

View File

@@ -38,6 +38,7 @@ use clap::Arg;
use log::info;
use postgres::{Client, NoTls};
use compute_tools::checker::create_writablity_check_data;
use compute_tools::config;
use compute_tools::http_api::launch_http_server;
use compute_tools::logger::*;
@@ -128,6 +129,7 @@ fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
handle_roles(&read_state.spec, &mut client)?;
handle_databases(&read_state.spec, &mut client)?;
create_writablity_check_data(&mut client)?;
// 'Close' connection
drop(client);

View File

@@ -0,0 +1,46 @@
use std::sync::{Arc, RwLock};
use anyhow::{anyhow, Result};
use log::error;
use postgres::Client;
use tokio_postgres::NoTls;
use crate::zenith::ComputeState;
pub fn create_writablity_check_data(client: &mut Client) -> Result<()> {
let query = "
CREATE TABLE IF NOT EXISTS health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();";
let result = client.simple_query(query)?;
if result.len() < 2 {
return Err(anyhow::format_err!("executed {} queries", result.len()));
}
Ok(())
}
pub async fn check_writability(state: &Arc<RwLock<ComputeState>>) -> Result<()> {
let connstr = state.read().unwrap().connstr.clone();
let (client, connection) = tokio_postgres::connect(&connstr, NoTls).await?;
if client.is_closed() {
return Err(anyhow!("connection to postgres closed"));
}
tokio::spawn(async move {
if let Err(e) = connection.await {
error!("connection error: {}", e);
}
});
let result = client
.simple_query("UPDATE health_check SET updated_at = now() WHERE id = 1;")
.await?;
if result.len() != 1 {
return Err(anyhow!("statement can't be executed"));
}
Ok(())
}

View File

@@ -11,7 +11,7 @@ use log::{error, info};
use crate::zenith::*;
// Service function to handle all available routes.
fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body> {
async fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body> {
match (req.method(), req.uri().path()) {
// Timestamp of the last Postgres activity in the plain text.
(&Method::GET, "/last_activity") => {
@@ -29,6 +29,15 @@ fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body
Response::new(Body::from(format!("{}", state.ready)))
}
(&Method::GET, "/check_writability") => {
info!("serving /check_writability GET request");
let res = crate::checker::check_writability(&state).await;
match res {
Ok(_) => Response::new(Body::from("true")),
Err(e) => Response::new(Body::from(e.to_string())),
}
}
// Return the `404 Not Found` for any other routes.
_ => {
let mut not_found = Response::new(Body::from("404 Not Found"));
@@ -48,7 +57,7 @@ async fn serve(state: Arc<RwLock<ComputeState>>) {
async move {
Ok::<_, Infallible>(service_fn(move |req: Request<Body>| {
let state = state.clone();
async move { Ok::<_, Infallible>(routes(req, state)) }
async move { Ok::<_, Infallible>(routes(req, state).await) }
}))
}
});

View File

@@ -2,6 +2,7 @@
//! Various tools and helpers to handle cluster / compute node (Postgres)
//! configuration.
//!
pub mod checker;
pub mod config;
pub mod http_api;
#[macro_use]

View File

@@ -132,7 +132,14 @@ impl Role {
let mut params: String = "LOGIN".to_string();
if let Some(pass) = &self.encrypted_password {
params.push_str(&format!(" PASSWORD 'md5{}'", pass));
// Some time ago we supported only md5 and treated all encrypted_password as md5.
// Now we also support SCRAM-SHA-256 and to preserve compatibility
// we treat all encrypted_password as md5 unless they starts with SCRAM-SHA-256.
if pass.starts_with("SCRAM-SHA-256") {
params.push_str(&format!(" PASSWORD '{}'", pass));
} else {
params.push_str(&format!(" PASSWORD 'md5{}'", pass));
}
} else {
params.push_str(" PASSWORD NULL");
}

View File

@@ -7,6 +7,7 @@ edition = "2021"
tar = "0.4.33"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
serde = { version = "1.0", features = ["derive"] }
serde_with = "1.12.0"
toml = "0.5"
lazy_static = "1.4"
regex = "1"
@@ -17,6 +18,6 @@ url = "2.2.2"
reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
pageserver = { path = "../pageserver" }
walkeeper = { path = "../walkeeper" }
safekeeper = { path = "../safekeeper" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { path = "../workspace_hack" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }

View File

@@ -331,14 +331,14 @@ impl PostgresNode {
// Configure the node to connect to the safekeepers
conf.append("synchronous_standby_names", "walproposer");
let wal_acceptors = self
let safekeepers = self
.env
.safekeepers
.iter()
.map(|sk| format!("localhost:{}", sk.pg_port))
.collect::<Vec<String>>()
.join(",");
conf.append("wal_acceptors", &wal_acceptors);
conf.append("wal_acceptors", &safekeepers);
} else {
// We only use setup without safekeepers for tests,
// and don't care about data durability on pageserver,
@@ -420,10 +420,15 @@ impl PostgresNode {
if let Some(token) = auth_token {
cmd.env("ZENITH_AUTH_TOKEN", token);
}
let pg_ctl = cmd.status().context("pg_ctl failed")?;
if !pg_ctl.success() {
anyhow::bail!("pg_ctl failed");
let pg_ctl = cmd.output().context("pg_ctl failed")?;
if !pg_ctl.status.success() {
anyhow::bail!(
"pg_ctl failed, exit code: {}, stdout: {}, stderr: {}",
pg_ctl.status,
String::from_utf8_lossy(&pg_ctl.stdout),
String::from_utf8_lossy(&pg_ctl.stderr),
);
}
Ok(())
}

View File

@@ -5,6 +5,7 @@
use anyhow::{bail, ensure, Context};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
use std::env;
use std::fs;
@@ -12,9 +13,7 @@ use std::path::{Path, PathBuf};
use std::process::{Command, Stdio};
use zenith_utils::auth::{encode_from_key_file, Claims, Scope};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{
HexZTenantId, HexZTimelineId, ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId,
};
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId};
use crate::safekeeper::SafekeeperNode;
@@ -25,6 +24,7 @@ use crate::safekeeper::SafekeeperNode;
// to 'zenith init --config=<path>' option. See control_plane/simple.conf for
// an example.
//
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct LocalEnv {
// Base directory for all the nodes (the pageserver, safekeepers and
@@ -50,12 +50,17 @@ pub struct LocalEnv {
// Default tenant ID to use with the 'zenith' command line utility, when
// --tenantid is not explicitly specified.
#[serde(default)]
pub default_tenant_id: Option<HexZTenantId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub default_tenant_id: Option<ZTenantId>,
// used to issue tokens during e.g pg start
#[serde(default)]
pub private_key_path: PathBuf,
// A comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'.
#[serde(default)]
pub broker_endpoints: Option<String>,
pub pageserver: PageServerConf,
#[serde(default)]
@@ -66,7 +71,8 @@ pub struct LocalEnv {
// A `HashMap<String, HashMap<ZTenantId, ZTimelineId>>` would be more appropriate here,
// but deserialization into a generic toml object as `toml::Value::try_from` fails with an error.
// https://toml.io/en/v1.0.0 does not contain a concept of "a table inside another table".
branch_name_mappings: HashMap<String, Vec<(HexZTenantId, HexZTimelineId)>>,
#[serde_as(as = "HashMap<_, Vec<(DisplayFromStr, DisplayFromStr)>>")]
branch_name_mappings: HashMap<String, Vec<(ZTenantId, ZTimelineId)>>,
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
@@ -164,9 +170,6 @@ impl LocalEnv {
.entry(branch_name.clone())
.or_default();
let tenant_id = HexZTenantId::from(tenant_id);
let timeline_id = HexZTimelineId::from(timeline_id);
let existing_ids = existing_values
.iter()
.find(|(existing_tenant_id, _)| existing_tenant_id == &tenant_id);
@@ -193,7 +196,6 @@ impl LocalEnv {
branch_name: &str,
tenant_id: ZTenantId,
) -> Option<ZTimelineId> {
let tenant_id = HexZTenantId::from(tenant_id);
self.branch_name_mappings
.get(branch_name)?
.iter()
@@ -207,13 +209,7 @@ impl LocalEnv {
.iter()
.flat_map(|(name, tenant_timelines)| {
tenant_timelines.iter().map(|&(tenant_id, timeline_id)| {
(
ZTenantTimelineId::new(
ZTenantId::from(tenant_id),
ZTimelineId::from(timeline_id),
),
name.clone(),
)
(ZTenantTimelineId::new(tenant_id, timeline_id), name.clone())
})
})
.collect()
@@ -259,7 +255,7 @@ impl LocalEnv {
// If no initial tenant ID was given, generate it.
if env.default_tenant_id.is_none() {
env.default_tenant_id = Some(HexZTenantId::from(ZTenantId::generate()));
env.default_tenant_id = Some(ZTenantId::generate());
}
env.base_data_dir = base_path();

View File

@@ -13,8 +13,8 @@ use nix::unistd::Pid;
use postgres::Config;
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use safekeeper::http::models::TimelineCreateRequest;
use thiserror::Error;
use walkeeper::http::models::TimelineCreateRequest;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId};
@@ -73,6 +73,8 @@ pub struct SafekeeperNode {
pub http_base_url: String,
pub pageserver: Arc<PageServerNode>,
broker_endpoints: Option<String>,
}
impl SafekeeperNode {
@@ -89,6 +91,7 @@ impl SafekeeperNode {
http_client: Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
pageserver,
broker_endpoints: env.broker_endpoints.clone(),
}
}
@@ -135,6 +138,9 @@ impl SafekeeperNode {
if !self.conf.sync {
cmd.arg("--no-sync");
}
if let Some(ref ep) = self.broker_endpoints {
cmd.args(&["--broker-endpoints", ep]);
}
if !cmd.status()?.success() {
bail!(

View File

@@ -1,4 +1,3 @@
use std::convert::TryFrom;
use std::io::Write;
use std::net::TcpStream;
use std::path::PathBuf;
@@ -10,7 +9,7 @@ use anyhow::{bail, Context};
use nix::errno::Errno;
use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid;
use pageserver::http::models::{TenantCreateRequest, TimelineCreateRequest, TimelineInfoResponse};
use pageserver::http::models::{TenantCreateRequest, TimelineCreateRequest};
use pageserver::timelines::TimelineInfo;
use postgres::{Config, NoTls};
use reqwest::blocking::{Client, RequestBuilder, Response};
@@ -19,7 +18,7 @@ use thiserror::Error;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{HexZTenantId, HexZTimelineId, ZTenantId, ZTimelineId};
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use crate::local_env::LocalEnv;
use crate::{fill_rust_env_vars, read_pidfile};
@@ -149,12 +148,20 @@ impl PageServerNode {
let initial_timeline_id_string = initial_timeline_id.to_string();
args.extend(["--initial-timeline-id", &initial_timeline_id_string]);
let init_output = fill_rust_env_vars(cmd.args(args))
let cmd_with_args = cmd.args(args);
let init_output = fill_rust_env_vars(cmd_with_args)
.output()
.context("pageserver init failed")?;
.with_context(|| {
format!("failed to init pageserver with command {:?}", cmd_with_args)
})?;
if !init_output.status.success() {
bail!("pageserver init failed");
bail!(
"init invocation failed, {}\nStdout: {}\nStderr: {}",
init_output.status,
String::from_utf8_lossy(&init_output.stdout),
String::from_utf8_lossy(&init_output.stderr)
);
}
Ok(initial_timeline_id)
@@ -338,9 +345,7 @@ impl PageServerNode {
) -> anyhow::Result<Option<ZTenantId>> {
let tenant_id_string = self
.http_request(Method::POST, format!("{}/tenant", self.http_base_url))
.json(&TenantCreateRequest {
new_tenant_id: new_tenant_id.map(HexZTenantId::from),
})
.json(&TenantCreateRequest { new_tenant_id })
.send()?
.error_from_body()?
.json::<Option<String>>()?;
@@ -358,7 +363,7 @@ impl PageServerNode {
}
pub fn timeline_list(&self, tenant_id: &ZTenantId) -> anyhow::Result<Vec<TimelineInfo>> {
let timeline_infos: Vec<TimelineInfoResponse> = self
let timeline_infos: Vec<TimelineInfo> = self
.http_request(
Method::GET,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
@@ -367,10 +372,7 @@ impl PageServerNode {
.error_from_body()?
.json()?;
timeline_infos
.into_iter()
.map(TimelineInfo::try_from)
.collect()
Ok(timeline_infos)
}
pub fn timeline_create(
@@ -386,16 +388,14 @@ impl PageServerNode {
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
)
.json(&TimelineCreateRequest {
new_timeline_id: new_timeline_id.map(HexZTimelineId::from),
new_timeline_id,
ancestor_start_lsn,
ancestor_timeline_id: ancestor_timeline_id.map(HexZTimelineId::from),
ancestor_timeline_id,
})
.send()?
.error_from_body()?
.json::<Option<TimelineInfoResponse>>()?;
.json::<Option<TimelineInfo>>()?;
timeline_info_response
.map(TimelineInfo::try_from)
.transpose()
Ok(timeline_info_response)
}
}

View File

@@ -10,5 +10,5 @@
- [pageserver/README](/pageserver/README) — pageserver overview.
- [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
- [walkeeper/README](/walkeeper/README) — WAL service overview.
- [safekeeper/README](/safekeeper/README) — WAL service overview.
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core

View File

@@ -21,7 +21,7 @@ NOTE:It has nothing to do with PostgreSQL pg_basebackup.
### Branch
We can create branch at certain LSN using `zenith branch` command.
We can create branch at certain LSN using `zenith timeline branch` command.
Each Branch lives in a corresponding timeline[] and has an ancestor[].
@@ -29,24 +29,32 @@ Each Branch lives in a corresponding timeline[] and has an ancestor[].
NOTE: This is an overloaded term.
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
### Checkpoint (Layered repository)
NOTE: This is an overloaded term.
Whenever enough WAL has been accumulated in memory, the page server []
writes out the changes from in-memory layers into new layer files[]. This process
is called "checkpointing". The page server only creates layer files for
relations that have been modified since the last checkpoint.
writes out the changes from the in-memory layer into a new delta layer file. This process
is called "checkpointing".
Configuration parameter `checkpoint_distance` defines the distance
from current LSN to perform checkpoint of in-memory layers.
Default is `DEFAULT_CHECKPOINT_DISTANCE`.
Set this parameter to `0` to force checkpoint of every layer.
Configuration parameter `checkpoint_period` defines the interval between checkpoint iterations.
Default is `DEFAULT_CHECKPOINT_PERIOD`.
### Compaction
A background operation on layer files. Compaction takes a number of L0
layer files, each of which covers the whole key space and a range of
LSN, and reshuffles the data in them into L1 files so that each file
covers the whole LSN range, but only part of the key space.
Compaction should also opportunistically leave obsolete page versions
from the L1 files, and materialize other page versions for faster
access. That hasn't been implemented as of this writing, though.
### Compute node
Stateless Postgres node that stores data in pageserver.
@@ -54,10 +62,10 @@ Stateless Postgres node that stores data in pageserver.
### Garbage collection
The process of removing old on-disk layers that are not needed by any timeline anymore.
### Fork
Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map.
Each PostgreSQL fork is considered a separate relish.
### Layer
@@ -72,15 +80,15 @@ are immutable. See pageserver/src/layered_repository/README.md for more.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
files are called "layer files". Each file corresponds to one RELISH_SEG_SIZE
segment of a PostgreSQL relation fork. There are two kinds of layer
files: image files and delta files. An image file contains a
"snapshot" of the segment at a particular LSN, and a delta file
contains WAL records applicable to the segment, in a range of LSNs.
files are called "layer files". There are two kinds of layer files:
image files and delta files. An image file contains a "snapshot" of a
range of keys at a particular LSN, and a delta file contains WAL
records applicable to a range of keys, in a range of LSNs.
### Layer map
The layer map tracks what layers exist for all the relishes in a timeline.
The layer map tracks what layers exist in a timeline.
### Layered repository
Zenith repository implementation that keeps data in layers.
@@ -100,10 +108,10 @@ PostgreSQL LSNs and functions to monitor them:
* `pg_current_wal_lsn()` - Returns the current write-ahead log write location.
* `pg_current_wal_flush_lsn()` - Returns the current write-ahead log flush location.
* `pg_last_wal_receive_lsn()` - Returns the last write-ahead log location that has been received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically.
* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
[source PostgreSQL documentation](https://www.postgresql.org/docs/devel/functions-admin.html):
Zenith safekeeper LSNs. For more check [walkeeper/README_PROTO.md](/walkeeper/README_PROTO.md)
Zenith safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md)
* `CommitLSN`: position in WAL confirmed by quorum safekeepers.
* `RestartLSN`: position in WAL confirmed by all safekeepers.
* `FlushLSN`: part of WAL persisted to the disk by safekeeper.
@@ -149,14 +157,6 @@ and create new databases and accounts (control plane API in our case).
The generic term in PostgreSQL for all objects in a database that have a name and a list of attributes defined in a specific order.
### Relish
We call each relation and other file that is stored in the
repository a "relish". It comes from "rel"-ish, as in "kind of a
rel", because it covers relations as well as other things that are
not relations, but are treated similarly for the purposes of the
storage layer.
### Replication slot
@@ -173,33 +173,24 @@ One repository corresponds to one Tenant.
How much history do we need to keep around for PITR and read-only nodes?
### Segment (PostgreSQL)
NOTE: This is an overloaded term.
### Segment
A physical file that stores data for a given relation. File segments are
limited in size by a compile-time setting (1 gigabyte by default), so if a
relation exceeds that size, it is split into multiple segments.
### Segment (Layered Repository)
NOTE: This is an overloaded term.
Segment is a RELISH_SEG_SIZE slice of relish (identified by a SegmentTag).
### SLRU
SLRUs include pg_clog, pg_multixact/members, and
pg_multixact/offsets. There are other SLRUs in PostgreSQL, but
they don't need to be stored permanently (e.g. pg_subtrans),
or we do not support them in zenith yet (pg_commit_ts).
Each SLRU segment is considered a separate relish[].
### Tenant (Multitenancy)
Tenant represents a single customer, interacting with Zenith.
Wal redo[] activity, timelines[], layers[] are managed for each tenant independently.
One pageserver[] can serve multiple tenants at once.
One safekeeper
One safekeeper
See `docs/multitenancy.md` for more.

View File

@@ -12,7 +12,7 @@ Init empty pageserver using `initdb` in temporary directory.
`--storage_dest=FILE_PREFIX | S3_PREFIX |...` option defines object storage type, all other parameters are passed via env variables. Inspired by WAL-G style naming : https://wal-g.readthedocs.io/STORAGES/.
Save`storage_dest` and other parameters in config.
Save`storage_dest` and other parameters in config.
Push snapshots to `storage_dest` in background.
```
@@ -21,7 +21,7 @@ zenith start
```
#### 2. Restart pageserver (manually or crash-recovery).
Take `storage_dest` from pageserver config, start pageserver from latest snapshot in `storage_dest`.
Take `storage_dest` from pageserver config, start pageserver from latest snapshot in `storage_dest`.
Push snapshots to `storage_dest` in background.
```
@@ -32,7 +32,7 @@ zenith start
Start pageserver from existing snapshot.
Path to snapshot provided via `--snapshot_path=FILE_PREFIX | S3_PREFIX | ...`
Do not save `snapshot_path` and `snapshot_format` in config, as it is a one-time operation.
Save`storage_dest` parameters in config.
Save`storage_dest` parameters in config.
Push snapshots to `storage_dest` in background.
```
//I.e. we want to start zenith on top of existing $PGDATA and use s3 as a persistent storage.
@@ -42,15 +42,15 @@ zenith start
How to pass credentials needed for `snapshot_path`?
#### 4. Export.
Manually push snapshot to `snapshot_path` which differs from `storage_dest`
Manually push snapshot to `snapshot_path` which differs from `storage_dest`
Optionally set `snapshot_format`, which can be plain pgdata format or zenith format.
```
zenith export --snapshot_path=FILE_PREFIX --snapshot_format=pgdata
```
#### Notes and questions
- walkeeper s3_offload should use same (similar) syntax for storage. How to set it in UI?
- safekeeper s3_offload should use same (similar) syntax for storage. How to set it in UI?
- Why do we need `zenith init` as a separate command? Can't we init everything at first start?
- We can think of better names for all options.
- Export to plain postgres format will be useless, if we are not 100% compatible on page level.
I can recall at least one such difference - PD_WAL_LOGGED flag in pages.
I can recall at least one such difference - PD_WAL_LOGGED flag in pages.

View File

@@ -0,0 +1,69 @@
# Safekeeper gossip
Extracted from this [PR](https://github.com/zenithdb/rfcs/pull/13)
## Motivation
In some situations, safekeeper (SK) needs coordination with other SK's that serve the same tenant:
1. WAL deletion. SK needs to know what WAL was already safely replicated to delete it. Now we keep WAL indefinitely.
2. Deciding on who is sending WAL to the pageserver. Now sending SK crash may lead to a livelock where nobody sends WAL to the pageserver.
3. To enable SK to SK direct recovery without involving the compute
## Summary
Compute node has connection strings to each safekeeper. During each compute->safekeeper connection establishment, the compute node should pass down all that connection strings to each safekeeper. With that info, safekeepers may establish Postgres connections to each other and periodically send ping messages with LSN payload.
## Components
safekeeper, compute, compute<->safekeeper protocol, possibly console (group SK addresses)
## Proposed implementation
Each safekeeper can periodically ping all its peers and share connectivity and liveness info. If the ping was not receiver for, let's say, four ping periods, we may consider sending safekeeper as dead. That would mean some of the alive safekeepers should connect to the pageserver. One way to decide which one exactly: `make_connection = my_node_id == min(alive_nodes)`
Since safekeepers are multi-tenant, we may establish either per-tenant physical connections or per-safekeeper ones. So it makes sense to group "logical" connections between corresponding tenants on different nodes into a single physical connection. That means that we should implement an interconnect thread that maintains physical connections and periodically broadcasts info about all tenants.
Right now console may assign any 3 SK addresses to a given compute node. That may lead to a high number of gossip connections between SK's. Instead, we can assign safekeeper triples to the compute node. But if we want to "break"/" change" group by an ad-hoc action, we can do it.
### Corner cases
- Current safekeeper may be alive but may not have connectivity to the pageserver
To address that, we need to gossip visibility info. Based on that info, we may define SK as alive only when it can connect to the pageserver.
- Current safekeeper may be alive but may not have connectivity with the compute node.
We may broadcast last_received_lsn and presence of compute connection and decide who is alive based on that.
- It is tricky to decide when to shut down gossip connections because we need to be sure that pageserver got all the committed (in the distributed sense, so local SK info is not enough) records, and it may never lose them. It is not a strict requirement since `--sync-safekeepers` that happen before the compute start will allow the pageserver to consume missing WAL, but it is better to do that in the background. So the condition may look like that: `majority_max(flush_lsn) == pageserver_s3_lsn` Here we rely on the two facts:
- that `--sync-safekeepers` happened after the compute shutdown, and it advanced local commit_lsn's allowing pageserver to consume that WAL.
- we wait for the `pageserver_s3_lsn` advancement to avoid pageserver's last_received_lsn/disk_consistent_lsn going backward due to the disk/hardware failure and subsequent S3 recovery
If those conditions are not met, we will have some gossip activity (but that may be okay).
## Pros/cons
Pros:
- distributed, does not introduce new services (like etcd), does not add console as a storage dependency
- lays the foundation for gossip-based recovery
Cons:
- Only compute knows a set of safekeepers, but they should communicate even without compute node. In case of safekeepers restart, we will lose that info and can't gossip anymore. Hence we can't trim some WAL tail until the compute node start. Also, it is ugly.
- If the console assigns a random set of safekeepers to each Postgres, we may end up in a situation where each safekeeper needs to have a connection with all other safekeepers. We can group safekeepers into isolated triples in the console to avoid that. Then "mixing" would happen only if we do rebalancing.
## Alternative implementation
We can have a selected node (e.g., console) with everybody reporting to it.
## Security implications
We don't increase the attack surface here. Communication can happen in a private network that is not exposed to users.
## Scalability implications
The only thing that may grow as we grow the number of computes is the number of gossip connections. But if we group safekeepers and assign a compute node to the random SK triple, the number of connections would be constant.

View File

@@ -0,0 +1,145 @@
# Why LSM trees?
In general, an LSM tree has the nice property that random updates are
fast, but the disk writes are sequential. When a new file is created,
it is immutable. New files are created and old ones are deleted, but
existing files are never modified. That fits well with storing the
files on S3.
Currently, we create a lot of small files. That is mostly a problem
with S3, because each GET/PUT operation is expensive, and LIST
operation only returns 1000 objects at a time, and isn't free
either. Currently, the files are "archived" together into larger
checkpoint files before they're uploaded to S3 to alleviate that
problem, but garbage collecting data from the archive files would be
difficult and we have not implemented it. This proposal addresses that
problem.
# Overview
```
^ LSN
|
| Memtable: +-----------------------------+
| | |
| +-----------------------------+
|
|
| L0: +-----------------------------+
| | |
| +-----------------------------+
|
| +-----------------------------+
| | |
| +-----------------------------+
|
| +-----------------------------+
| | |
| +-----------------------------+
|
| +-----------------------------+
| | |
| +-----------------------------+
|
|
| L1: +-------+ +-----+ +--+ +-+
| | | | | | | | |
| | | | | | | | |
| +-------+ +-----+ +--+ +-+
|
| +----+ +-----+ +--+ +----+
| | | | | | | | |
| | | | | | | | |
| +----+ +-----+ +--+ +----+
|
+--------------------------------------------------------------> Page ID
+---+
| | Layer file
+---+
```
# Memtable
When new WAL arrives, it is first put into the Memtable. Despite the
name, the Memtable is not a purely in-memory data structure. It can
spill to a temporary file on disk if the system is low on memory, and
is accessed through a buffer cache.
If the page server crashes, the Memtable is lost. It is rebuilt by
processing again the WAL that's newer than the latest layer in L0.
The size of the Memtable is configured by the "checkpoint distance"
setting. Because anything that hasn't been flushed to disk and
uploaded to S3 yet needs to be kept in the safekeeper, the "checkpoint
distance" also determines the amount of WAL that needs to kept in the
safekeeper.
# L0
When the Memtable fills up, it is written out to a new file in L0. The
files are immutable; when a file is created, it is never
modified. Each file in L0 is roughly 1 GB in size (*). Like the
Memtable, each file in L0 covers the whole key range.
When enough files have been accumulated in L0, compaction
starts. Compaction processes all the files in L0 and reshuffles the
data to create a new set of files in L1.
(*) except in corner cases like if we want to shut down the page
server and want to flush out the memtable to disk even though it's not
full yet.
# L1
L1 consists of ~ 1 GB files like L0. But each file covers only part of
the overall key space, and a larger range of LSNs. This speeds up
searches. When you're looking for a given page, you need to check all
the files in L0, to see if they contain a page version for the requested
page. But in L1, you only need to check the files whose key range covers
the requested page. This is particularly important at cold start, when
checking a file means downloading it from S3.
Partitioning by key range also helps with garbage collection. If only a
part of the database is updated, we will accumulate more files for
the hot part in L1, and old files can be removed without affecting the
cold part.
# Image layers
So far, we've only talked about delta layers. In addition to the delta
layers, we create image layers, when "enough" WAL has been accumulated
for some part of the database. Each image layer covers a 1 GB range of
key space. It contains images of the pages at a single LSN, a snapshot
if you will.
The exact heuristic for what "enough" means is not clear yet. Maybe
create a new image layer when 10 GB of WAL has been accumulated for a
1 GB segment.
The image layers limit the number of layers that a search needs to
check. That put a cap on read latency, and it also allows garbage
collecting layers that are older than the GC horizon.
# Partitioning scheme
When compaction happens and creates a new set of files in L1, how do
we partition the data into the files?
- Goal is that each file is ~ 1 GB in size
- Try to match partition boundaries at relation boundaries. (See [1]
for how PebblesDB does this, and for why that's important)
- Greedy algorithm
# Additional Reading
[1] Paper on PebblesDB and how it does partitioning.
https://www.cs.utexas.edu/~rak/papers/sosp17-pebblesdb.pdf

View File

@@ -0,0 +1,295 @@
# Storage messaging
Created on 19.01.22
Initially created [here](https://github.com/zenithdb/rfcs/pull/16) by @kelvich.
That it is an alternative to (014-safekeeper-gossip)[]
## Motivation
As in 014-safekeeper-gossip we need to solve the following problems:
* Trim WAL on safekeepers
* Decide on which SK should push WAL to the S3
* Decide on which SK should forward WAL to the pageserver
* Decide on when to shut down SK<->pageserver connection
This RFC suggests a more generic and hopefully more manageable way to address those problems. However, unlike 014-safekeeper-gossip, it does not bring us any closer to safekeeper-to-safekeeper recovery but rather unties two sets of different issues we previously wanted to solve with gossip.
Also, with this approach, we would not need "call me maybe" anymore, and the pageserver will have all the data required to understand that it needs to reconnect to another safekeeper.
## Summary
Instead of p2p gossip, let's have a centralized broker where all the storage nodes report per-timeline state. Each storage node should have a `--broker-url=1.2.3.4` CLI param.
Here I propose two ways to do that. After a lot of arguing with myself, I'm leaning towards the etcd approach. My arguments for it are in the pros/cons section. Both options require adding a Grpc client in our codebase either directly or as an etcd dependency.
## Non-goals
That RFC does *not* suggest moving the compute to pageserver and compute to safekeeper mappings out of the console. The console is still the only place in the cluster responsible for the persistency of that info. So I'm implying that each pageserver and safekeeper exactly knows what timelines he serves, as it currently is. We need some mechanism for a new pageserver to discover mapping info, but that is out of the scope of this RFC.
## Impacted components
pageserver, safekeeper
adds either etcd or console as a storage dependency
## Possible implementation: custom message broker in the console
We've decided to go with an etcd approach instead of the message broker.
<details closed>
<summary>Original suggestion</summary>
<br>
We can add a Grpc service in the console that acts as a message broker since the console knows the addresses of all the components. The broker can ignore the payload and only redirect messages. So, for example, each safekeeper may send a message to the peering safekeepers or to the pageserver responsible for a given timeline.
Message format could be `{sender, destination, payload}`.
The destination is either:
1. `sk_#{tenant}_#{timeline}` -- to be broadcasted on all safekeepers, responsible for that timeline, or
2. `pserver_#{tenant}_#{timeline}` -- to be broadcasted on all pageservers, responsible for that timeline
Sender is either:
1. `sk_#{sk_id}`, or
2. `pserver_#{pserver_id}`
I can think of the following behavior to address our original problems:
* WAL trimming
Each safekeeper periodically broadcasts `(write_lsn, commit_lsn)` to all peering (peering == responsible for that timeline) safekeepers
* Decide on which SK should push WAL to the S3
Each safekeeper periodically broadcasts `i_am_alive_#{current_timestamp}` message to all peering safekeepers. That way, safekeepers may maintain the vector of alive peers (loose one, with false negatives). Alive safekeeper with the minimal id pushes data to S3.
* Decide on which SK should forward WAL to the pageserver
Each safekeeper periodically sends (write_lsn, commit_lsn, compute_connected) to the relevant pageservers. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`.
Pageserver connection to the safekeeper triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore.
Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection).
* Decide on when to shutdown sk<->pageserver connection
Again, pageserver would have all the info to understand when to shut down the safekeeper connection.
### Scalability
One node is enough (c) No, seriously, it is enough.
### High Availability
Broker lives in the console, so we can rely on k8s maintaining the console app alive.
If the console is down, we won't trim WAL and reconnect the pageserver to another safekeeper. But, at the same, if the console is down, we already can't accept new compute connections and start stopped computes, so we are making things a bit worse, but not dramatically.
### Interactions
```
.________________.
sk_1 <-> | | <-> pserver_1
... | Console broker | ...
sk_n <-> |________________| <-> pserver_m
```
</details>
## Implementation: etcd state store
Alternatively, we can set up `etcd` and maintain the following data structure in it:
```ruby
"compute_#{tenant}_#{timeline}" => {
safekeepers => {
"sk_#{sk_id}" => {
write_lsn: "0/AEDF130",
commit_lsn: "0/AEDF100",
compute_connected: true,
last_updated: 1642621138,
},
}
}
```
As etcd doesn't support field updates in the nested objects that translates to the following set of keys:
```ruby
"compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/write_lsn",
"compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/commit_lsn",
...
```
Each storage node can subscribe to the relevant sets of keys and maintain a local view of that structure. So in terms of the data flow, everything is the same as in the previous approach. Still, we can avoid implementing the message broker and prevent runtime storage dependency on a console.
### Safekeeper address discovery
During the startup safekeeper should publish the address he is listening on as the part of `{"sk_#{sk_id}" => ip_address}`. Then the pageserver can resolve `sk_#{sk_id}` to the actual address. This way it would work both locally and in the cloud setup. Safekeeper should have `--advertised-address` CLI option so that we can listen on e.g. 0.0.0.0 but advertize something more useful.
### Safekeeper behavior
For each timeline safekeeper periodically broadcasts `compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/*` fields. It subscribes to changes of `compute_#{tenant}_#{timeline}` -- that way safekeeper will have an information about peering safekeepers.
That amount of information is enough to properly trim WAL. To decide on who is pushing the data to S3 safekeeper may use etcd leases or broadcast a timestamp and hence track who is alive.
### Pageserver behavior
Pageserver subscribes to `compute_#{tenant}_#{timeline}` for each tenant it owns. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`.
Pageserver connection to the safekeeper can be triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore.
As an alternative to compute_connected, we can track timestamp of the latest message arrived to safekeeper from compute. Usually compute broadcasts KeepAlive to all safekeepers every second, so it'll be updated every second when connection is ok. Then the connection can be considered down when this timestamp isn't updated for a several seconds.
This will help to faster detect issues with safekeeper (and switch to another) in the following cases:
when compute failed but TCP connection stays alive until timeout (usually about a minute)
when safekeeper failed and didn't set compute_connected to false
Another way to deal with [2] is to process (write_lsn, commit_lsn, compute_connected) as a KeepAlive on the pageserver side and detect issues when sk_id don't send anything for some time. This way is fully compliant to this RFC.
Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection).
### Interactions
```
.________________.
sk_1 <-> | | <-> pserver_1
... | etcd | ...
sk_n <-> |________________| <-> pserver_m
```
### Sequence diagrams for different workflows
#### Cluster startup
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
PS1->>M: subscribe to updates to state of timeline N
C->>+SK1: WAL push
loop constantly update current lsns
SK1->>-M: I'm at lsn A
end
C->>+SK2: WAL push
loop constantly update current lsns
SK2->>-M: I'm at lsn B
end
C->>+SK3: WAL push
loop constantly update current lsns
SK3->>-M: I'm at lsn C
end
loop request pages
C->>+PS1: get_page@lsn
PS1->>-C: page image
end
M->>PS1: New compute appeared for timeline N. SK1 at A, SK2 at B, SK3 at C
note over PS1: Say SK1 at A=200, SK2 at B=150 SK3 at C=100 <br> so connect to SK1 because it is the most up to date one
PS1->>SK1: start replication
```
#### Behavour of services during typical operations
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
note over C,M: Scenario 1: Pageserver checkpoint
note over PS1: Upload data to S3
PS1->>M: Update remote consistent lsn
M->>SK1: propagate remote consistent lsn update
note over SK1: truncate WAL up to remote consistent lsn
M->>SK2: propagate remote consistent lsn update
note over SK2: truncate WAL up to remote consistent lsn
M->>SK3: propagate remote consistent lsn update
note over SK3: truncate WAL up to remote consistent lsn
note over C,M: Scenario 2: SK1 finds itself lagging behind MAX(150 (SK2), 200 (SK2)) - 100 (SK1) > THRESHOLD
SK1->>SK2: Fetch WAL delta between 100 (SK1) and 200 (SK2)
note over C,M: Scenario 3: PS1 detects that SK1 is lagging behind: Connection from SK1 is broken or there is no messages from it in 30 seconds.
note over PS1: e.g. SK2 is at 150, SK3 is at 100, chose SK2 as a new replication source
PS1->>SK2: start replication
```
#### Behaviour during timeline relocation
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
note over C,M: Timeline is being relocated from PS1 to PS2
O->>+PS2: Attach timeline
PS2->>-O: 202 Accepted if timeline exists in S3
note over PS2: Download timeline from S3
note over O: Poll for timeline download (or subscribe to metadata service)
loop wait for attach to complete
O->>PS2: timeline detail should answer that timeline is ready
end
PS2->>M: Register downloaded timeline
PS2->>M: Get safekeepers for timeline, subscribe to changes
PS2->>SK1: Start replication to catch up
note over O: PS2 catched up, time to switch compute
O->>C: Restart compute with new pageserver url in config
note over C: Wal push is restarted
loop request pages
C->>+PS2: get_page@lsn
PS2->>-C: page image
end
O->>PS1: detach timeline
note over C,M: Scenario 1: Attach call failed
O--xPS2: Attach timeline
note over O: The operation can be safely retried, <br> if we hit some threshold we can try another pageserver
note over C,M: Scenario 2: Attach succeeded but pageserver failed to download the data or start replication
loop wait for attach to complete
O--xPS2: timeline detail should answer that timeline is ready
end
note over O: Can wait for a timeout, and then try another pageserver <br> there should be a limit on number of different pageservers to try
note over C,M: Scenario 3: Detach fails
O--xPS1: Detach timeline
note over O: can be retried, if continues to fail might lead to data duplication in s3
```
# Pros/cons
## Console broker/etcd vs gossip:
Gossip pros:
* gossip allows running storage without the console or etcd
Console broker/etcd pros:
* simpler
* solves "call me maybe" as well
* avoid possible N-to-N connection issues with gossip without grouping safekeepers in pre-defined triples
## Console broker vs. etcd:
Initially, I wanted to avoid etcd as a dependency mostly because I've seen how painful for Clickhouse was their ZooKeeper dependency: in each chat, at each conference, people were complaining about configuration and maintenance barriers with ZooKeeper. It was that bad that ClickHouse re-implemented ZooKeeper to embed it: https://clickhouse.com/docs/en/operations/clickhouse-keeper/.
But with an etcd we are in a bit different situation:
1. We don't need persistency and strong consistency guarantees for the data we store in the etcd
2. etcd uses Grpc as a protocol, and messages are pretty simple
So it looks like implementing in-mem store with etcd interface is straightforward thing _if we will want that in future_. At the same time, we can avoid implementing it right now, and we will be able to run local zenith installation with etcd running somewhere in the background (as opposed to building and running console, which in turn requires Postgres).

View File

@@ -0,0 +1,79 @@
Cluster size limits
==================
## Summary
One of the resource consumption limits for free-tier users is a cluster size limit.
To enforce it, we need to calculate the timeline size and check if the limit is reached before relation create/extend operations.
If the limit is reached, the query must fail with some meaningful error/warning.
We may want to exempt some operations from the quota to allow users free space to fit back into the limit.
The stateless compute node that performs validation is separate from the storage that calculates the usage, so we need to exchange cluster size information between those components.
## Motivation
Limit the maximum size of a PostgreSQL instance to limit free tier users (and other tiers in the future).
First of all, this is needed to control our free tier production costs.
Another reason to limit resources is risk management — we haven't (fully) tested and optimized zenith for big clusters,
so we don't want to give users access to the functionality that we don't think is ready.
## Components
* pageserver - calculate the size consumed by a timeline and add it to the feedback message.
* safekeeper - pass feedback message from pageserver to compute.
* compute - receive feedback message, enforce size limit based on GUC `zenith.max_cluster_size`.
* console - set and update `zenith.max_cluster_size` setting
## Proposed implementation
First of all, it's necessary to define timeline size.
The current approach is to count all data, including SLRUs. (not including WAL)
Here we think of it as a physical disk underneath the Postgres cluster.
This is how the `LOGICAL_TIMELINE_SIZE` metric is implemented in the pageserver.
Alternatively, we could count only relation data. As in pg_database_size().
This approach is somewhat more user-friendly because it is the data that is really affected by the user.
On the other hand, it puts us in a weaker position than other services, i.e., RDS.
We will need to refactor the timeline_size counter or add another counter to implement it.
Timeline size is updated during wal digestion. It is not versioned and is valid at the last_received_lsn moment.
Then this size should be reported to compute node.
`current_timeline_size` value is included in the walreceiver's custom feedback message: `ZenithFeedback.`
(PR about protocol changes https://github.com/zenithdb/zenith/pull/1037).
This message is received by the safekeeper and propagated to compute node as a part of `AppendResponse`.
Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable.
And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > zenith.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so.
(see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html))
TODO:
We can allow autovacuum processes to bypass this check, simply checking `IsAutoVacuumWorkerProcess()`.
It would be nice to allow manual VACUUM and VACUUM FULL to bypass the check, but it's uneasy to distinguish these operations at the low level.
See issues https://github.com/neondatabase/neon/issues/1245
https://github.com/zenithdb/zenith/issues/1445
TODO:
We should warn users if the limit is soon to be reached.
### **Reliability, failure modes and corner cases**
1. `current_timeline_size` is valid at the last received and digested by pageserver lsn.
If pageserver lags behind compute node, `current_timeline_size` will lag too. This lag can be tuned using backpressure, but it is not expected to be 0 all the time.
So transactions that happen in this lsn range may cause limit overflow. Especially operations that generate (i.e., CREATE DATABASE) or free (i.e., TRUNCATE) a lot of data pages while generating a small amount of WAL. Are there other operations like this?
Currently, CREATE DATABASE operations are restricted in the console. So this is not an issue.
### **Security implications**
We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM.
Malicious users may change the `zenith.max_cluster_size`, so we need an extra size limit check.
To cover this case, we also monitor the compute node size in the console.

View File

@@ -68,11 +68,11 @@ S3.
The unit is # of bytes.
#### checkpoint_period
#### compaction_period
The pageserver checks whether `checkpoint_distance` has been reached
every `checkpoint_period` seconds. Default is 1 s, which should be
fine.
Every `compaction_period` seconds, the page server checks if
maintenance operations, like compaction, are needed on the layer
files. Default is 1 s, which should be fine.
#### gc_horizon

View File

@@ -57,16 +57,18 @@ PostgreSQL extension that implements storage manager API and network communicati
PostgreSQL extension that contains functions needed for testing and debugging.
`/walkeeper`:
`/safekeeper`:
The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
It acts as a holding area and redistribution center for recently generated WAL.
For more detailed info, see `/walkeeper/README`
For more detailed info, see `/safekeeper/README`
`/workspace_hack`:
The workspace_hack crate exists only to pin down some dependencies.
We use [cargo-hakari](https://crates.io/crates/cargo-hakari) for automation.
`/zenith`
Main entry point for the 'zenith' CLI utility.

View File

@@ -4,19 +4,20 @@ version = "0.1.0"
edition = "2021"
[dependencies]
bookfile = { git = "https://github.com/zenithdb/bookfile.git", branch="generic-readext" }
chrono = "0.4.19"
rand = "0.8.3"
regex = "1.4.5"
bytes = { version = "1.0.1", features = ['serde'] }
byteorder = "1.4.3"
futures = "0.3.13"
hex = "0.4.3"
hyper = "0.14"
itertools = "0.10.3"
lazy_static = "1.4.0"
log = "0.4.14"
clap = "3.0"
daemonize = "0.4.1"
tokio = { version = "1.11", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
tokio-util = { version = "0.7", features = ["io"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
@@ -25,17 +26,16 @@ tokio-stream = "0.1.8"
anyhow = { version = "1.0", features = ["backtrace"] }
crc32c = "0.6.0"
thiserror = "1.0"
hex = { version = "0.4.3", features = ["serde"] }
tar = "0.4.33"
humantime = "2.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "1.12.0"
toml_edit = { version = "0.13", features = ["easy"] }
scopeguard = "1.1.0"
async-trait = "0.1"
const_format = "0.2.21"
tracing = "0.1.27"
tracing-futures = "0.2"
signal-hook = "0.3.10"
url = "2"
nix = "0.23"
@@ -43,13 +43,15 @@ once_cell = "1.8.0"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
rust-s3 = { version = "0.28", default-features = false, features = ["no-verify-ssl", "tokio-rustls-tls"] }
rusoto_core = "0.47"
rusoto_s3 = "0.47"
async-trait = "0.1"
async-compression = {version = "0.3", features = ["zstd", "tokio"]}
postgres_ffi = { path = "../postgres_ffi" }
zenith_metrics = { path = "../zenith_metrics" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { path = "../workspace_hack" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
[dev-dependencies]
hex-literal = "0.3"

View File

@@ -13,7 +13,7 @@ keeps track of WAL records which are not synced to S3 yet.
The Page Server consists of multiple threads that operate on a shared
repository of page versions:
```
| WAL
V
+--------------+
@@ -46,7 +46,7 @@ Legend:
---> Data flow
<---
```
Page Service
------------

View File

@@ -10,18 +10,19 @@
//! This module is responsible for creation of such tarball
//! from data stored in object storage.
//!
use anyhow::{Context, Result};
use anyhow::{ensure, Context, Result};
use bytes::{BufMut, BytesMut};
use log::*;
use std::fmt::Write as FmtWrite;
use std::io;
use std::io::Write;
use std::sync::Arc;
use std::time::SystemTime;
use tar::{Builder, EntryType, Header};
use tracing::*;
use crate::relish::*;
use crate::reltag::SlruKind;
use crate::repository::Timeline;
use crate::DatadirTimelineImpl;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::*;
use zenith_utils::lsn::Lsn;
@@ -31,7 +32,7 @@ use zenith_utils::lsn::Lsn;
/// used for constructing tarball.
pub struct Basebackup<'a> {
ar: Builder<&'a mut dyn Write>,
timeline: &'a Arc<dyn Timeline>,
timeline: &'a Arc<DatadirTimelineImpl>,
pub lsn: Lsn,
prev_record_lsn: Lsn,
}
@@ -46,7 +47,7 @@ pub struct Basebackup<'a> {
impl<'a> Basebackup<'a> {
pub fn new(
write: &'a mut dyn Write,
timeline: &'a Arc<dyn Timeline>,
timeline: &'a Arc<DatadirTimelineImpl>,
req_lsn: Option<Lsn>,
) -> Result<Basebackup<'a>> {
// Compute postgres doesn't have any previous WAL files, but the first
@@ -64,13 +65,14 @@ impl<'a> Basebackup<'a> {
// prev_lsn to Lsn(0) if we cannot provide the correct value.
let (backup_prev, backup_lsn) = if let Some(req_lsn) = req_lsn {
// Backup was requested at a particular LSN. Wait for it to arrive.
timeline.wait_lsn(req_lsn)?;
info!("waiting for {}", req_lsn);
timeline.tline.wait_lsn(req_lsn)?;
// If the requested point is the end of the timeline, we can
// provide prev_lsn. (get_last_record_rlsn() might return it as
// zero, though, if no WAL has been generated on this timeline
// yet.)
let end_of_timeline = timeline.get_last_record_rlsn();
let end_of_timeline = timeline.tline.get_last_record_rlsn();
if req_lsn == end_of_timeline.last {
(end_of_timeline.prev, req_lsn)
} else {
@@ -78,7 +80,7 @@ impl<'a> Basebackup<'a> {
}
} else {
// Backup was requested at end of the timeline.
let end_of_timeline = timeline.get_last_record_rlsn();
let end_of_timeline = timeline.tline.get_last_record_rlsn();
(end_of_timeline.prev, end_of_timeline.last)
};
@@ -115,21 +117,24 @@ impl<'a> Basebackup<'a> {
}
// Gather non-relational files from object storage pages.
for obj in self.timeline.list_nonrels(self.lsn)? {
match obj {
RelishTag::Slru { slru, segno } => {
self.add_slru_segment(slru, segno)?;
}
RelishTag::FileNodeMap { spcnode, dbnode } => {
self.add_relmap_file(spcnode, dbnode)?;
}
RelishTag::TwoPhase { xid } => {
self.add_twophase_file(xid)?;
}
_ => {}
for kind in [
SlruKind::Clog,
SlruKind::MultiXactOffsets,
SlruKind::MultiXactMembers,
] {
for segno in self.timeline.list_slru_segments(kind, self.lsn)? {
self.add_slru_segment(kind, segno)?;
}
}
// Create tablespace directories
for ((spcnode, dbnode), has_relmap_file) in self.timeline.list_dbdirs(self.lsn)? {
self.add_dbdir(spcnode, dbnode, has_relmap_file)?;
}
for xid in self.timeline.list_twophase_files(self.lsn)? {
self.add_twophase_file(xid)?;
}
// Generate pg_control and bootstrap WAL segment.
self.add_pgcontrol_file()?;
self.ar.finish()?;
@@ -141,28 +146,15 @@ impl<'a> Basebackup<'a> {
// Generate SLRU segment files from repository.
//
fn add_slru_segment(&mut self, slru: SlruKind, segno: u32) -> anyhow::Result<()> {
let seg_size = self
.timeline
.get_relish_size(RelishTag::Slru { slru, segno }, self.lsn)?;
if seg_size == None {
trace!(
"SLRU segment {}/{:>04X} was truncated",
slru.to_str(),
segno
);
return Ok(());
}
let nblocks = seg_size.unwrap();
let nblocks = self.timeline.get_slru_segment_size(slru, segno, self.lsn)?;
let mut slru_buf: Vec<u8> =
Vec::with_capacity(nblocks as usize * pg_constants::BLCKSZ as usize);
for blknum in 0..nblocks {
let img =
self.timeline
.get_page_at_lsn(RelishTag::Slru { slru, segno }, blknum, self.lsn)?;
assert!(img.len() == pg_constants::BLCKSZ as usize);
let img = self
.timeline
.get_slru_page_at_lsn(slru, segno, blknum, self.lsn)?;
ensure!(img.len() == pg_constants::BLCKSZ as usize);
slru_buf.extend_from_slice(&img);
}
@@ -176,16 +168,26 @@ impl<'a> Basebackup<'a> {
}
//
// Extract pg_filenode.map files from repository
// Along with them also send PG_VERSION for each database.
// Include database/tablespace directories.
//
fn add_relmap_file(&mut self, spcnode: u32, dbnode: u32) -> anyhow::Result<()> {
let img = self.timeline.get_page_at_lsn(
RelishTag::FileNodeMap { spcnode, dbnode },
0,
self.lsn,
)?;
let path = if spcnode == pg_constants::GLOBALTABLESPACE_OID {
// Each directory contains a PG_VERSION file, and the default database
// directories also contain pg_filenode.map files.
//
fn add_dbdir(
&mut self,
spcnode: u32,
dbnode: u32,
has_relmap_file: bool,
) -> anyhow::Result<()> {
let relmap_img = if has_relmap_file {
let img = self.timeline.get_relmap_file(spcnode, dbnode, self.lsn)?;
ensure!(img.len() == 512);
Some(img)
} else {
None
};
if spcnode == pg_constants::GLOBALTABLESPACE_OID {
let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes();
let header = new_tar_header("PG_VERSION", version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
@@ -193,26 +195,51 @@ impl<'a> Basebackup<'a> {
let header = new_tar_header("global/PG_VERSION", version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
String::from("global/pg_filenode.map") // filenode map for global tablespace
if let Some(img) = relmap_img {
// filenode map for global tablespace
let header = new_tar_header("global/pg_filenode.map", img.len() as u64)?;
self.ar.append(&header, &img[..])?;
} else {
warn!("global/pg_filenode.map is missing");
}
} else {
// User defined tablespaces are not supported. However, as
// a special case, if a tablespace/db directory is
// completely empty, we can leave it out altogether. This
// makes taking a base backup after the 'tablespace'
// regression test pass, because the test drops the
// created tablespaces after the tests.
//
// FIXME: this wouldn't be necessary, if we handled
// XLOG_TBLSPC_DROP records. But we probably should just
// throw an error on CREATE TABLESPACE in the first place.
if !has_relmap_file
&& self
.timeline
.list_rels(spcnode, dbnode, self.lsn)?
.is_empty()
{
return Ok(());
}
// User defined tablespaces are not supported
assert!(spcnode == pg_constants::DEFAULTTABLESPACE_OID);
ensure!(spcnode == pg_constants::DEFAULTTABLESPACE_OID);
// Append dir path for each database
let path = format!("base/{}", dbnode);
let header = new_tar_header_dir(&path)?;
self.ar.append(&header, &mut io::empty())?;
let dst_path = format!("base/{}/PG_VERSION", dbnode);
let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes();
let header = new_tar_header(&dst_path, version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
if let Some(img) = relmap_img {
let dst_path = format!("base/{}/PG_VERSION", dbnode);
let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes();
let header = new_tar_header(&dst_path, version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
format!("base/{}/pg_filenode.map", dbnode)
let relmap_path = format!("base/{}/pg_filenode.map", dbnode);
let header = new_tar_header(&relmap_path, img.len() as u64)?;
self.ar.append(&header, &img[..])?;
}
};
assert!(img.len() == 512);
let header = new_tar_header(&path, img.len() as u64)?;
self.ar.append(&header, &img[..])?;
Ok(())
}
@@ -220,9 +247,7 @@ impl<'a> Basebackup<'a> {
// Extract twophase state files
//
fn add_twophase_file(&mut self, xid: TransactionId) -> anyhow::Result<()> {
let img = self
.timeline
.get_page_at_lsn(RelishTag::TwoPhase { xid }, 0, self.lsn)?;
let img = self.timeline.get_twophase_file(xid, self.lsn)?;
let mut buf = BytesMut::new();
buf.extend_from_slice(&img[..]);
@@ -242,11 +267,11 @@ impl<'a> Basebackup<'a> {
fn add_pgcontrol_file(&mut self) -> anyhow::Result<()> {
let checkpoint_bytes = self
.timeline
.get_page_at_lsn(RelishTag::Checkpoint, 0, self.lsn)
.get_checkpoint(self.lsn)
.context("failed to get checkpoint bytes")?;
let pg_control_bytes = self
.timeline
.get_page_at_lsn(RelishTag::ControlFile, 0, self.lsn)
.get_control_file(self.lsn)
.context("failed get control bytes")?;
let mut pg_control = ControlFileData::decode(&pg_control_bytes)?;
let mut checkpoint = CheckPoint::decode(&checkpoint_bytes)?;
@@ -267,7 +292,7 @@ impl<'a> Basebackup<'a> {
// add zenith.signal file
let mut zenith_signal = String::new();
if self.prev_record_lsn == Lsn(0) {
if self.lsn == self.timeline.get_ancestor_lsn() {
if self.lsn == self.timeline.tline.get_ancestor_lsn() {
write!(zenith_signal, "PREV LSN: none")?;
} else {
write!(zenith_signal, "PREV LSN: invalid")?;
@@ -291,7 +316,7 @@ impl<'a> Basebackup<'a> {
let wal_file_path = format!("pg_wal/{}", wal_file_name);
let header = new_tar_header(&wal_file_path, pg_constants::WAL_SEGMENT_SIZE as u64)?;
let wal_seg = generate_wal_segment(segno, pg_control.system_identifier);
assert!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE);
ensure!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE);
self.ar.append(&header, &wal_seg[..])?;
Ok(())
}

View File

@@ -4,6 +4,7 @@
use anyhow::Result;
use clap::{App, Arg};
use pageserver::layered_repository::dump_layerfile_from_path;
use pageserver::page_cache;
use pageserver::virtual_file;
use std::path::PathBuf;
use zenith_utils::GIT_VERSION;
@@ -24,8 +25,9 @@ fn main() -> Result<()> {
// Basic initialization of things that don't change after startup
virtual_file::init(10);
page_cache::init(100);
dump_layerfile_from_path(&path)?;
dump_layerfile_from_path(&path, true)?;
Ok(())
}

View File

@@ -18,16 +18,18 @@ use daemonize::Daemonize;
use pageserver::{
config::{defaults::*, PageServerConf},
http, page_cache, page_service, remote_storage, tenant_mgr, thread_mgr,
http, page_cache, page_service,
remote_storage::{self, SyncStartupData},
repository::{Repository, TimelineSyncStatusUpdate},
tenant_mgr, thread_mgr,
thread_mgr::ThreadKind,
timelines, virtual_file, LOG_FILE_NAME,
};
use zenith_utils::http::endpoint;
use zenith_utils::postgres_backend;
use zenith_utils::shutdown::exit_now;
use zenith_utils::signals::{self, Signal};
fn main() -> Result<()> {
fn main() -> anyhow::Result<()> {
zenith_metrics::set_common_metrics_prefix("pageserver");
let arg_matches = App::new("Zenith page server")
.about("Materializes WAL stream to pages and serves them to the postgres")
@@ -113,7 +115,7 @@ fn main() -> Result<()> {
// We're initializing the repo, so there's no config file yet
DEFAULT_CONFIG_FILE
.parse::<toml_edit::Document>()
.expect("could not parse built-in config file")
.context("could not parse built-in config file")?
} else {
// Supplement the CLI arguments with the config file
let cfg_file_contents = std::fs::read_to_string(&cfg_file_path)
@@ -160,8 +162,7 @@ fn main() -> Result<()> {
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
page_cache::init(conf);
page_cache::init(conf.page_cache_size);
// Create repo and exit if init was requested
if init {
@@ -207,7 +208,9 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
// There shouldn't be any logging to stdin/stdout. Redirect it to the main log so
// that we will see any accidental manual fprintf's or backtraces.
let stdout = log_file.try_clone().unwrap();
let stdout = log_file
.try_clone()
.with_context(|| format!("Failed to clone log file '{:?}'", log_file))?;
let stderr = log_file;
let daemonize = Daemonize::new()
@@ -227,11 +230,47 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
}
let signals = signals::install_shutdown_handlers()?;
let sync_startup = remote_storage::start_local_timeline_sync(conf)
// Initialize repositories with locally available timelines.
// Timelines that are only partially available locally (remote storage has more data than this pageserver)
// are scheduled for download and added to the repository once download is completed.
let SyncStartupData {
remote_index,
local_timeline_init_statuses,
} = remote_storage::start_local_timeline_sync(conf)
.context("Failed to set up local files sync with external storage")?;
// Initialize tenant manager.
tenant_mgr::set_timeline_states(conf, sync_startup.initial_timeline_states);
for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses {
// initialize local tenant
let repo = tenant_mgr::load_local_repo(conf, tenant_id, &remote_index);
for (timeline_id, init_status) in local_timeline_init_statuses {
match init_status {
remote_storage::LocalTimelineInitStatus::LocallyComplete => {
debug!("timeline {} for tenant {} is locally complete, registering it in repository", tenant_id, timeline_id);
// Lets fail here loudly to be on the safe side.
// XXX: It may be a better api to actually distinguish between repository startup
// and processing of newly downloaded timelines.
repo.apply_timeline_remote_sync_status_update(
timeline_id,
TimelineSyncStatusUpdate::Downloaded,
)
.with_context(|| {
format!(
"Failed to bootstrap timeline {} for tenant {}",
timeline_id, tenant_id
)
})?
}
remote_storage::LocalTimelineInitStatus::NeedsSync => {
debug!(
"timeline {} for tenant {} needs sync, \
so skipped for adding into repository until sync is finished",
tenant_id, timeline_id
);
}
}
}
}
// initialize authentication for incoming connections
let auth = match &conf.auth_type {
@@ -252,8 +291,9 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
None,
None,
"http_endpoint_thread",
false,
move || {
let router = http::make_router(conf, auth_cloned);
let router = http::make_router(conf, auth_cloned, remote_index);
endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher())
},
)?;
@@ -265,6 +305,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
None,
None,
"libpq endpoint thread",
false,
move || page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type),
)?;
@@ -282,38 +323,8 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
"Got {}. Terminating gracefully in fast shutdown mode",
signal.name()
);
shutdown_pageserver();
pageserver::shutdown_pageserver();
unreachable!()
}
})
}
fn shutdown_pageserver() {
// Shut down the libpq endpoint thread. This prevents new connections from
// being accepted.
thread_mgr::shutdown_threads(Some(ThreadKind::LibpqEndpointListener), None, None);
// Shut down any page service threads.
postgres_backend::set_pgbackend_shutdown_requested();
thread_mgr::shutdown_threads(Some(ThreadKind::PageRequestHandler), None, None);
// Shut down all the tenants. This flushes everything to disk and kills
// the checkpoint and GC threads.
tenant_mgr::shutdown_all_tenants();
// Stop syncing with remote storage.
//
// FIXME: Does this wait for the sync thread to finish syncing what's queued up?
// Should it?
thread_mgr::shutdown_threads(Some(ThreadKind::StorageSync), None, None);
// Shut down the HTTP endpoint last, so that you can still check the server's
// status while it's shutting down.
thread_mgr::shutdown_threads(Some(ThreadKind::HttpEndpointListener), None, None);
// There should be nothing left, but let's be sure
thread_mgr::shutdown_threads(None, None, None);
info!("Shut down successfully completed");
std::process::exit(0);
}

View File

@@ -30,8 +30,14 @@ pub mod defaults {
// FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB
// would be more appropriate. But a low value forces the code to be exercised more,
// which is good for now to trigger bugs.
// This parameter actually determines L0 layer file size.
pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024;
pub const DEFAULT_CHECKPOINT_PERIOD: &str = "1 s";
// Target file size, when creating image and delta layers.
// This parameter determines L1 layer file size.
pub const DEFAULT_COMPACTION_TARGET_SIZE: u64 = 128 * 1024 * 1024;
pub const DEFAULT_COMPACTION_PERIOD: &str = "1 s";
pub const DEFAULT_COMPACTION_THRESHOLD: usize = 10;
pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
pub const DEFAULT_GC_PERIOD: &str = "100 s";
@@ -40,7 +46,7 @@ pub mod defaults {
pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s";
pub const DEFAULT_SUPERUSER: &str = "zenith_admin";
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 100;
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 10;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
@@ -57,7 +63,9 @@ pub mod defaults {
#listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}'
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#checkpoint_period = '{DEFAULT_CHECKPOINT_PERIOD}'
#compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes
#compaction_period = '{DEFAULT_COMPACTION_PERIOD}'
#compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}'
#gc_period = '{DEFAULT_GC_PERIOD}'
#gc_horizon = {DEFAULT_GC_HORIZON}
@@ -90,8 +98,18 @@ pub struct PageServerConf {
// Flush out an inmemory layer, if it's holding WAL older than this
// This puts a backstop on how much WAL needs to be re-digested if the
// page server crashes.
// This parameter actually determines L0 layer file size.
pub checkpoint_distance: u64,
pub checkpoint_period: Duration,
// Target file size, when creating image and delta layers.
// This parameter determines L1 layer file size.
pub compaction_target_size: u64,
// How often to check if there's compaction work to be done.
pub compaction_period: Duration,
// Level0 delta layer threshold for compaction.
pub compaction_threshold: usize,
pub gc_horizon: u64,
pub gc_period: Duration,
@@ -145,7 +163,10 @@ struct PageServerConfigBuilder {
listen_http_addr: BuilderValue<String>,
checkpoint_distance: BuilderValue<u64>,
checkpoint_period: BuilderValue<Duration>,
compaction_target_size: BuilderValue<u64>,
compaction_period: BuilderValue<Duration>,
compaction_threshold: BuilderValue<usize>,
gc_horizon: BuilderValue<u64>,
gc_period: BuilderValue<Duration>,
@@ -179,8 +200,10 @@ impl Default for PageServerConfigBuilder {
listen_pg_addr: Set(DEFAULT_PG_LISTEN_ADDR.to_string()),
listen_http_addr: Set(DEFAULT_HTTP_LISTEN_ADDR.to_string()),
checkpoint_distance: Set(DEFAULT_CHECKPOINT_DISTANCE),
checkpoint_period: Set(humantime::parse_duration(DEFAULT_CHECKPOINT_PERIOD)
.expect("cannot parse default checkpoint period")),
compaction_target_size: Set(DEFAULT_COMPACTION_TARGET_SIZE),
compaction_period: Set(humantime::parse_duration(DEFAULT_COMPACTION_PERIOD)
.expect("cannot parse default compaction period")),
compaction_threshold: Set(DEFAULT_COMPACTION_THRESHOLD),
gc_horizon: Set(DEFAULT_GC_HORIZON),
gc_period: Set(humantime::parse_duration(DEFAULT_GC_PERIOD)
.expect("cannot parse default gc period")),
@@ -216,8 +239,16 @@ impl PageServerConfigBuilder {
self.checkpoint_distance = BuilderValue::Set(checkpoint_distance)
}
pub fn checkpoint_period(&mut self, checkpoint_period: Duration) {
self.checkpoint_period = BuilderValue::Set(checkpoint_period)
pub fn compaction_target_size(&mut self, compaction_target_size: u64) {
self.compaction_target_size = BuilderValue::Set(compaction_target_size)
}
pub fn compaction_period(&mut self, compaction_period: Duration) {
self.compaction_period = BuilderValue::Set(compaction_period)
}
pub fn compaction_threshold(&mut self, compaction_threshold: usize) {
self.compaction_threshold = BuilderValue::Set(compaction_threshold)
}
pub fn gc_horizon(&mut self, gc_horizon: u64) {
@@ -286,9 +317,15 @@ impl PageServerConfigBuilder {
checkpoint_distance: self
.checkpoint_distance
.ok_or(anyhow::anyhow!("missing checkpoint_distance"))?,
checkpoint_period: self
.checkpoint_period
.ok_or(anyhow::anyhow!("missing checkpoint_period"))?,
compaction_target_size: self
.compaction_target_size
.ok_or(anyhow::anyhow!("missing compaction_target_size"))?,
compaction_period: self
.compaction_period
.ok_or(anyhow::anyhow!("missing compaction_period"))?,
compaction_threshold: self
.compaction_threshold
.ok_or(anyhow::anyhow!("missing compaction_threshold"))?,
gc_horizon: self
.gc_horizon
.ok_or(anyhow::anyhow!("missing gc_horizon"))?,
@@ -337,10 +374,10 @@ pub struct RemoteStorageConfig {
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum RemoteStorageKind {
/// Storage based on local file system.
/// Specify a root folder to place all stored relish data into.
/// Specify a root folder to place all stored files into.
LocalFs(PathBuf),
/// AWS S3 based storage, storing all relishes into the root
/// of the S3 bucket from the config.
/// AWS S3 based storage, storing all files in the S3 bucket
/// specified by the config
AwsS3(S3Config),
}
@@ -425,7 +462,13 @@ impl PageServerConf {
"listen_pg_addr" => builder.listen_pg_addr(parse_toml_string(key, item)?),
"listen_http_addr" => builder.listen_http_addr(parse_toml_string(key, item)?),
"checkpoint_distance" => builder.checkpoint_distance(parse_toml_u64(key, item)?),
"checkpoint_period" => builder.checkpoint_period(parse_toml_duration(key, item)?),
"compaction_target_size" => {
builder.compaction_target_size(parse_toml_u64(key, item)?)
}
"compaction_period" => builder.compaction_period(parse_toml_duration(key, item)?),
"compaction_threshold" => {
builder.compaction_threshold(parse_toml_u64(key, item)? as usize)
}
"gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?),
"gc_period" => builder.gc_period(parse_toml_duration(key, item)?),
"wait_lsn_timeout" => builder.wait_lsn_timeout(parse_toml_duration(key, item)?),
@@ -561,7 +604,9 @@ impl PageServerConf {
PageServerConf {
id: ZNodeId(0),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: Duration::from_secs(10),
compaction_target_size: 4 * 1024 * 1024,
compaction_period: Duration::from_secs(10),
compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD,
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: Duration::from_secs(10),
wait_lsn_timeout: Duration::from_secs(60),
@@ -631,7 +676,10 @@ listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
checkpoint_distance = 111 # in bytes
checkpoint_period = '111 s'
compaction_target_size = 111 # in bytes
compaction_period = '111 s'
compaction_threshold = 2
gc_period = '222 s'
gc_horizon = 222
@@ -668,7 +716,9 @@ id = 10
listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(),
listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: humantime::parse_duration(defaults::DEFAULT_CHECKPOINT_PERIOD)?,
compaction_target_size: defaults::DEFAULT_COMPACTION_TARGET_SIZE,
compaction_period: humantime::parse_duration(defaults::DEFAULT_COMPACTION_PERIOD)?,
compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD,
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?,
wait_lsn_timeout: humantime::parse_duration(defaults::DEFAULT_WAIT_LSN_TIMEOUT)?,
@@ -712,7 +762,9 @@ id = 10
listen_pg_addr: "127.0.0.1:64000".to_string(),
listen_http_addr: "127.0.0.1:9898".to_string(),
checkpoint_distance: 111,
checkpoint_period: Duration::from_secs(111),
compaction_target_size: 111,
compaction_period: Duration::from_secs(111),
compaction_threshold: 2,
gc_horizon: 222,
gc_period: Duration::from_secs(222),
wait_lsn_timeout: Duration::from_secs(111),

View File

@@ -1,122 +1,36 @@
use crate::timelines::TimelineInfo;
use anyhow::{anyhow, bail, Context};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use zenith_utils::{
lsn::Lsn,
zid::{HexZTenantId, HexZTimelineId, ZNodeId, ZTenantId, ZTimelineId},
zid::{ZNodeId, ZTenantId, ZTimelineId},
};
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct TimelineCreateRequest {
pub new_timeline_id: Option<HexZTimelineId>,
pub ancestor_timeline_id: Option<HexZTimelineId>,
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub new_timeline_id: Option<ZTimelineId>,
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_timeline_id: Option<ZTimelineId>,
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_start_lsn: Option<Lsn>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct TenantCreateRequest {
pub new_tenant_id: Option<HexZTenantId>,
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub new_tenant_id: Option<ZTenantId>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct TimelineInfoResponse {
pub kind: String,
#[serde(with = "hex")]
timeline_id: ZTimelineId,
#[serde(with = "hex")]
tenant_id: ZTenantId,
disk_consistent_lsn: String,
last_record_lsn: Option<String>,
prev_record_lsn: Option<String>,
ancestor_timeline_id: Option<HexZTimelineId>,
ancestor_lsn: Option<String>,
current_logical_size: Option<usize>,
current_logical_size_non_incremental: Option<usize>,
}
impl From<TimelineInfo> for TimelineInfoResponse {
fn from(other: TimelineInfo) -> Self {
match other {
TimelineInfo::Local {
timeline_id,
tenant_id,
last_record_lsn,
prev_record_lsn,
ancestor_timeline_id,
ancestor_lsn,
disk_consistent_lsn,
current_logical_size,
current_logical_size_non_incremental,
} => TimelineInfoResponse {
kind: "Local".to_owned(),
timeline_id,
tenant_id,
disk_consistent_lsn: disk_consistent_lsn.to_string(),
last_record_lsn: Some(last_record_lsn.to_string()),
prev_record_lsn: Some(prev_record_lsn.to_string()),
ancestor_timeline_id: ancestor_timeline_id.map(HexZTimelineId::from),
ancestor_lsn: ancestor_lsn.map(|lsn| lsn.to_string()),
current_logical_size: Some(current_logical_size),
current_logical_size_non_incremental,
},
TimelineInfo::Remote {
timeline_id,
tenant_id,
disk_consistent_lsn,
} => TimelineInfoResponse {
kind: "Remote".to_owned(),
timeline_id,
tenant_id,
disk_consistent_lsn: disk_consistent_lsn.to_string(),
last_record_lsn: None,
prev_record_lsn: None,
ancestor_timeline_id: None,
ancestor_lsn: None,
current_logical_size: None,
current_logical_size_non_incremental: None,
},
}
}
}
impl TryFrom<TimelineInfoResponse> for TimelineInfo {
type Error = anyhow::Error;
fn try_from(other: TimelineInfoResponse) -> anyhow::Result<Self> {
let parse_lsn_hex_string = |lsn_string: String| {
lsn_string
.parse::<Lsn>()
.with_context(|| format!("Failed to parse Lsn as hex string from '{}'", lsn_string))
};
let disk_consistent_lsn = parse_lsn_hex_string(other.disk_consistent_lsn)?;
Ok(match other.kind.as_str() {
"Local" => TimelineInfo::Local {
timeline_id: other.timeline_id,
tenant_id: other.tenant_id,
last_record_lsn: other
.last_record_lsn
.ok_or(anyhow!("Local timeline should have last_record_lsn"))
.and_then(parse_lsn_hex_string)?,
prev_record_lsn: other
.prev_record_lsn
.ok_or(anyhow!("Local timeline should have prev_record_lsn"))
.and_then(parse_lsn_hex_string)?,
ancestor_timeline_id: other.ancestor_timeline_id.map(ZTimelineId::from),
ancestor_lsn: other.ancestor_lsn.map(parse_lsn_hex_string).transpose()?,
disk_consistent_lsn,
current_logical_size: other.current_logical_size.ok_or(anyhow!("No "))?,
current_logical_size_non_incremental: other.current_logical_size_non_incremental,
},
"Remote" => TimelineInfo::Remote {
timeline_id: other.timeline_id,
tenant_id: other.tenant_id,
disk_consistent_lsn,
},
unknown => bail!("Unknown timeline kind: {}", unknown),
})
}
}
#[serde(transparent)]
pub struct TenantCreateResponse(#[serde_as(as = "DisplayFromStr")] pub ZTenantId);
#[derive(Serialize)]
pub struct StatusResponse {

View File

@@ -18,7 +18,7 @@ paths:
schema:
type: object
required:
- id
- id
properties:
id:
type: integer
@@ -122,6 +122,110 @@ paths:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline/{timeline_id}/attach:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
format: hex
- name: timeline_id
in: path
required: true
schema:
type: string
format: hex
post:
description: Attach remote timeline
responses:
"200":
description: Timeline attaching scheduled
"400":
description: Error when no tenant id found in path or no timeline id
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"404":
description: Timeline not found
content:
application/json:
schema:
$ref: "#/components/schemas/NotFoundError"
"409":
description: Timeline download is already in progress
content:
application/json:
schema:
$ref: "#/components/schemas/ConflictError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline/{timeline_id}/detach:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
format: hex
- name: timeline_id
in: path
required: true
schema:
type: string
format: hex
post:
description: Detach local timeline
responses:
"200":
description: Timeline detached
"400":
description: Error when no tenant id found in path or no timeline id
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline/:
parameters:
- name: tenant_id
@@ -148,6 +252,7 @@ paths:
format: hex
ancestor_start_lsn:
type: string
format: hex
responses:
"201":
description: TimelineInfo
@@ -178,7 +283,7 @@ paths:
content:
application/json:
schema:
$ref: "#/components/schemas/AlreadyExistsError"
$ref: "#/components/schemas/ConflictError"
"500":
description: Generic operation error
content:
@@ -259,7 +364,7 @@ paths:
content:
application/json:
schema:
$ref: "#/components/schemas/AlreadyExistsError"
$ref: "#/components/schemas/ConflictError"
"500":
description: Generic operation error
content:
@@ -289,7 +394,6 @@ components:
required:
- timeline_id
- tenant_id
- disk_consistent_lsn
properties:
timeline_id:
type: string
@@ -297,17 +401,44 @@ components:
tenant_id:
type: string
format: hex
local:
$ref: "#/components/schemas/LocalTimelineInfo"
remote:
$ref: "#/components/schemas/RemoteTimelineInfo"
RemoteTimelineInfo:
type: object
required:
- awaits_download
properties:
awaits_download:
type: boolean
remote_consistent_lsn:
type: string
format: hex
LocalTimelineInfo:
type: object
required:
- last_record_lsn
- disk_consistent_lsn
- timeline_state
properties:
last_record_lsn:
type: string
prev_record_lsn:
format: hex
disk_consistent_lsn:
type: string
format: hex
timeline_state:
type: string
ancestor_timeline_id:
type: string
format: hex
ancestor_lsn:
type: string
disk_consistent_lsn:
format: hex
prev_record_lsn:
type: string
format: hex
current_logical_size:
type: integer
current_logical_size_non_incremental:
@@ -327,14 +458,21 @@ components:
properties:
msg:
type: string
AlreadyExistsError:
ForbiddenError:
type: object
required:
- msg
properties:
msg:
type: string
ForbiddenError:
NotFoundError:
type: object
required:
- msg
properties:
msg:
type: string
ConflictError:
type: object
required:
- msg

View File

@@ -16,24 +16,29 @@ use zenith_utils::http::{
request::parse_request_param,
};
use zenith_utils::http::{RequestExt, RouterBuilder};
use zenith_utils::zid::{HexZTenantId, ZTimelineId};
use zenith_utils::zid::{ZTenantTimelineId, ZTimelineId};
use super::models::{
StatusResponse, TenantCreateRequest, TimelineCreateRequest, TimelineInfoResponse,
StatusResponse, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest,
};
use crate::repository::RepositoryTimeline;
use crate::timelines::TimelineInfo;
use crate::remote_storage::{schedule_timeline_download, RemoteIndex};
use crate::repository::Repository;
use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo};
use crate::{config::PageServerConf, tenant_mgr, timelines, ZTenantId};
#[derive(Debug)]
struct State {
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
remote_index: RemoteIndex,
allowlist_routes: Vec<Uri>,
}
impl State {
fn new(conf: &'static PageServerConf, auth: Option<Arc<JwtAuth>>) -> Self {
fn new(
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
remote_index: RemoteIndex,
) -> Self {
let allowlist_routes = ["/v1/status", "/v1/doc", "/swagger.yml"]
.iter()
.map(|v| v.parse().unwrap())
@@ -42,6 +47,7 @@ impl State {
conf,
auth,
allowlist_routes,
remote_index,
}
}
}
@@ -62,10 +68,7 @@ fn get_config(request: &Request<Body>) -> &'static PageServerConf {
// healthcheck handler
async fn status_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let config = get_config(&request);
Ok(json_response(
StatusCode::OK,
StatusResponse { id: config.id },
)?)
json_response(StatusCode::OK, StatusResponse { id: config.id })
}
async fn timeline_create_handler(mut request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -88,7 +91,7 @@ async fn timeline_create_handler(mut request: Request<Body>) -> Result<Response<
.map_err(ApiError::from_err)??;
Ok(match new_timeline_info {
Some(info) => json_response(StatusCode::CREATED, TimelineInfoResponse::from(info))?,
Some(info) => json_response(StatusCode::CREATED, info)?,
None => json_response(StatusCode::CONFLICT, ())?,
})
}
@@ -97,16 +100,35 @@ async fn timeline_list_handler(request: Request<Body>) -> Result<Response<Body>,
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
let response_data: Vec<TimelineInfoResponse> = tokio::task::spawn_blocking(move || {
let local_timeline_infos = tokio::task::spawn_blocking(move || {
let _enter = info_span!("timeline_list", tenant = %tenant_id).entered();
crate::timelines::get_timelines(tenant_id, include_non_incremental_logical_size)
crate::timelines::get_local_timelines(tenant_id, include_non_incremental_logical_size)
})
.await
.map_err(ApiError::from_err)??
.into_iter()
.map(TimelineInfoResponse::from)
.collect();
Ok(json_response(StatusCode::OK, response_data)?)
.map_err(ApiError::from_err)??;
let mut response_data = Vec::with_capacity(local_timeline_infos.len());
for (timeline_id, local_timeline_info) in local_timeline_infos {
response_data.push(TimelineInfo {
tenant_id,
timeline_id,
local: Some(local_timeline_info),
remote: get_state(&request)
.remote_index
.read()
.await
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.map(|remote_entry| RemoteTimelineInfo {
remote_consistent_lsn: remote_entry.disk_consistent_lsn(),
awaits_download: remote_entry.get_awaits_download(),
}),
})
}
json_response(StatusCode::OK, response_data)
}
// Gate non incremental logical size calculation behind a flag
@@ -129,25 +151,60 @@ async fn timeline_detail_handler(request: Request<Body>) -> Result<Response<Body
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
let response_data = tokio::task::spawn_blocking(move || {
let _enter =
info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id)
.entered();
let span = info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id);
let (local_timeline_info, span) = tokio::task::spawn_blocking(move || {
let entered = span.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
let include_non_incremental_logical_size =
get_include_non_incremental_logical_size(&request);
Ok::<_, anyhow::Error>(TimelineInfo::from_repo_timeline(
tenant_id,
repo.get_timeline(timeline_id)?,
include_non_incremental_logical_size,
))
let local_timeline = {
repo.get_timeline(timeline_id)
.as_ref()
.map(|timeline| {
LocalTimelineInfo::from_repo_timeline(
tenant_id,
timeline_id,
timeline,
include_non_incremental_logical_size,
)
})
.transpose()?
};
Ok::<_, anyhow::Error>((local_timeline, entered.exit()))
})
.await
.map_err(ApiError::from_err)?
.map(TimelineInfoResponse::from)?;
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
let remote_timeline_info = {
let remote_index_read = get_state(&request).remote_index.read().await;
remote_index_read
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.map(|remote_entry| RemoteTimelineInfo {
remote_consistent_lsn: remote_entry.disk_consistent_lsn(),
awaits_download: remote_entry.get_awaits_download(),
})
};
let _enter = span.entered();
if local_timeline_info.is_none() && remote_timeline_info.is_none() {
return Err(ApiError::NotFound(
"Timeline is not found neither locally nor remotely".to_string(),
));
}
let timeline_info = TimelineInfo {
tenant_id,
timeline_id,
local: local_timeline_info,
remote: remote_timeline_info,
};
json_response(StatusCode::OK, timeline_info)
}
async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -155,32 +212,39 @@ async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let span = info_span!("timeline_attach_handler", tenant = %tenant_id, timeline = %timeline_id);
tokio::task::spawn_blocking(move || {
let _enter =
info_span!("timeline_attach_handler", tenant = %tenant_id, timeline = %timeline_id)
.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
match repo.get_timeline(timeline_id)? {
RepositoryTimeline::Local { .. } => {
anyhow::bail!("Timeline with id {} is already local", timeline_id)
}
RepositoryTimeline::Remote {
id: _,
disk_consistent_lsn: _,
} => {
// FIXME (rodionov) get timeline already schedules timeline for download, and duplicate tasks can cause errors
// first should be fixed in https://github.com/zenithdb/zenith/issues/997
// TODO (rodionov) change timeline state to awaits download (incapsulate it somewhere in the repo)
// TODO (rodionov) can we safely request replication on the timeline before sync is completed? (can be implemented on top of the #997)
Ok(())
}
}
let span = tokio::task::spawn_blocking(move || {
let entered = span.entered();
if tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id).is_ok() {
// TODO: maybe answer with 309 Not Modified here?
anyhow::bail!("Timeline is already present locally")
};
Ok(entered.exit())
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::ACCEPTED, ())?)
let mut remote_index_write = get_state(&request).remote_index.write().await;
let _enter = span.entered(); // entered guard cannot live across awaits (non Send)
let index_entry = remote_index_write
.timeline_entry_mut(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.ok_or_else(|| ApiError::NotFound("Unknown remote timeline".to_string()))?;
if index_entry.get_awaits_download() {
return Err(ApiError::Conflict(
"Timeline download is already in progress".to_string(),
));
}
index_entry.set_awaits_download(true);
schedule_timeline_download(tenant_id, timeline_id);
json_response(StatusCode::ACCEPTED, ())
}
async fn timeline_detach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -199,7 +263,7 @@ async fn timeline_detach_handler(request: Request<Body>) -> Result<Response<Body
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, ())?)
json_response(StatusCode::OK, ())
}
async fn tenant_list_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -213,7 +277,7 @@ async fn tenant_list_handler(request: Request<Body>) -> Result<Response<Body>, A
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
json_response(StatusCode::OK, response_data)
}
async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -221,19 +285,23 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
check_permission(&request, None)?;
let request_data: TenantCreateRequest = json_request(&mut request).await?;
let remote_index = get_state(&request).remote_index.clone();
let target_tenant_id = request_data
.new_tenant_id
.map(ZTenantId::from)
.unwrap_or_else(ZTenantId::generate);
let new_tenant_id = tokio::task::spawn_blocking(move || {
let _enter = info_span!("tenant_create", tenant = ?request_data.new_tenant_id).entered();
tenant_mgr::create_tenant_repository(
get_config(&request),
request_data.new_tenant_id.map(ZTenantId::from),
)
let _enter = info_span!("tenant_create", tenant = ?target_tenant_id).entered();
tenant_mgr::create_tenant_repository(get_config(&request), target_tenant_id, remote_index)
})
.await
.map_err(ApiError::from_err)??;
Ok(match new_tenant_id {
Some(id) => json_response(StatusCode::CREATED, HexZTenantId::from(id))?,
Some(id) => json_response(StatusCode::CREATED, TenantCreateResponse(id))?,
None => json_response(StatusCode::CONFLICT, ())?,
})
}
@@ -248,6 +316,7 @@ async fn handler_404(_: Request<Body>) -> Result<Response<Body>, ApiError> {
pub fn make_router(
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
remote_index: RemoteIndex,
) -> RouterBuilder<hyper::Body, ApiError> {
let spec = include_bytes!("openapi_spec.yml");
let mut router = attach_openapi_ui(endpoint::make_router(), spec, "/swagger.yml", "/v1/doc");
@@ -263,7 +332,7 @@ pub fn make_router(
}
router
.data(Arc::new(State::new(conf, auth)))
.data(Arc::new(State::new(conf, auth, remote_index)))
.get("/v1/status", status_handler)
.get("/v1/tenant", tenant_list_handler)
.post("/v1/tenant", tenant_create_handler)

View File

@@ -11,14 +11,15 @@ use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
use tracing::*;
use crate::relish::*;
use crate::repository::*;
use crate::pgdatadir_mapping::*;
use crate::reltag::{RelTag, SlruKind};
use crate::repository::Repository;
use crate::walingest::WalIngest;
use postgres_ffi::relfile_utils::*;
use postgres_ffi::waldecoder::*;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::Oid;
use postgres_ffi::{pg_constants, ControlFileData, DBState_DB_SHUTDOWNED};
use postgres_ffi::{Oid, TransactionId};
use zenith_utils::lsn::Lsn;
///
@@ -27,42 +28,47 @@ use zenith_utils::lsn::Lsn;
/// This is currently only used to import a cluster freshly created by initdb.
/// The code that deals with the checkpoint would not work right if the
/// cluster was not shut down cleanly.
pub fn import_timeline_from_postgres_datadir(
pub fn import_timeline_from_postgres_datadir<R: Repository>(
path: &Path,
writer: &dyn TimelineWriter,
tline: &mut DatadirTimeline<R>,
lsn: Lsn,
) -> Result<()> {
let mut pg_control: Option<ControlFileData> = None;
let mut modification = tline.begin_modification(lsn);
modification.init_empty()?;
// Scan 'global'
let mut relfiles: Vec<PathBuf> = Vec::new();
for direntry in fs::read_dir(path.join("global"))? {
let direntry = direntry?;
match direntry.file_name().to_str() {
None => continue,
Some("pg_control") => {
pg_control = Some(import_control_file(writer, lsn, &direntry.path())?);
pg_control = Some(import_control_file(&mut modification, &direntry.path())?);
}
Some("pg_filenode.map") => {
import_relmap_file(
&mut modification,
pg_constants::GLOBALTABLESPACE_OID,
0,
&direntry.path(),
)?;
}
Some("pg_filenode.map") => import_nonrel_file(
writer,
lsn,
RelishTag::FileNodeMap {
spcnode: pg_constants::GLOBALTABLESPACE_OID,
dbnode: 0,
},
&direntry.path(),
)?,
// Load any relation files into the page server
_ => import_relfile(
&direntry.path(),
writer,
lsn,
pg_constants::GLOBALTABLESPACE_OID,
0,
)?,
// Load any relation files into the page server (but only after the other files)
_ => relfiles.push(direntry.path()),
}
}
for relfile in relfiles {
import_relfile(
&mut modification,
&relfile,
pg_constants::GLOBALTABLESPACE_OID,
0,
)?;
}
// Scan 'base'. It contains database dirs, the database OID is the filename.
// E.g. 'base/12345', where 12345 is the database OID.
@@ -70,60 +76,62 @@ pub fn import_timeline_from_postgres_datadir(
let direntry = direntry?;
//skip all temporary files
if direntry.file_name().to_str().unwrap() == "pgsql_tmp" {
if direntry.file_name().to_string_lossy() == "pgsql_tmp" {
continue;
}
let dboid = direntry.file_name().to_str().unwrap().parse::<u32>()?;
let dboid = direntry.file_name().to_string_lossy().parse::<u32>()?;
let mut relfiles: Vec<PathBuf> = Vec::new();
for direntry in fs::read_dir(direntry.path())? {
let direntry = direntry?;
match direntry.file_name().to_str() {
None => continue,
Some("PG_VERSION") => continue,
Some("pg_filenode.map") => import_nonrel_file(
writer,
lsn,
RelishTag::FileNodeMap {
spcnode: pg_constants::DEFAULTTABLESPACE_OID,
dbnode: dboid,
},
Some("PG_VERSION") => {
//modification.put_dbdir_creation(pg_constants::DEFAULTTABLESPACE_OID, dboid)?;
}
Some("pg_filenode.map") => import_relmap_file(
&mut modification,
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
&direntry.path(),
)?,
// Load any relation files into the page server
_ => import_relfile(
&direntry.path(),
writer,
lsn,
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
)?,
_ => relfiles.push(direntry.path()),
}
}
for relfile in relfiles {
import_relfile(
&mut modification,
&relfile,
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
)?;
}
}
for entry in fs::read_dir(path.join("pg_xact"))? {
let entry = entry?;
import_slru_file(writer, lsn, SlruKind::Clog, &entry.path())?;
import_slru_file(&mut modification, SlruKind::Clog, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("members"))? {
let entry = entry?;
import_slru_file(writer, lsn, SlruKind::MultiXactMembers, &entry.path())?;
import_slru_file(&mut modification, SlruKind::MultiXactMembers, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("offsets"))? {
let entry = entry?;
import_slru_file(writer, lsn, SlruKind::MultiXactOffsets, &entry.path())?;
import_slru_file(&mut modification, SlruKind::MultiXactOffsets, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_twophase"))? {
let entry = entry?;
let xid = u32::from_str_radix(entry.path().to_str().unwrap(), 16)?;
import_nonrel_file(writer, lsn, RelishTag::TwoPhase { xid }, &entry.path())?;
let xid = u32::from_str_radix(&entry.path().to_string_lossy(), 16)?;
import_twophase_file(&mut modification, xid, &entry.path())?;
}
// TODO: Scan pg_tblspc
// We're done importing all the data files.
writer.advance_last_record_lsn(lsn);
modification.commit()?;
// We expect the Postgres server to be shut down cleanly.
let pg_control = pg_control.context("pg_control file not found")?;
@@ -141,7 +149,7 @@ pub fn import_timeline_from_postgres_datadir(
// *after* the checkpoint record. And crucially, it initializes the 'prev_lsn'.
import_wal(
&path.join("pg_wal"),
writer,
tline,
Lsn(pg_control.checkPointCopy.redo),
lsn,
)?;
@@ -150,46 +158,53 @@ pub fn import_timeline_from_postgres_datadir(
}
// subroutine of import_timeline_from_postgres_datadir(), to load one relation file.
fn import_relfile(
fn import_relfile<R: Repository>(
modification: &mut DatadirModification<R>,
path: &Path,
timeline: &dyn TimelineWriter,
lsn: Lsn,
spcoid: Oid,
dboid: Oid,
) -> Result<()> {
) -> anyhow::Result<()> {
// Does it look like a relation file?
trace!("importing rel file {}", path.display());
let p = parse_relfilename(path.file_name().unwrap().to_str().unwrap());
if let Err(e) = p {
warn!("unrecognized file in postgres datadir: {:?} ({})", path, e);
return Err(e.into());
}
let (relnode, forknum, segno) = p.unwrap();
let (relnode, forknum, segno) = parse_relfilename(&path.file_name().unwrap().to_string_lossy())
.map_err(|e| {
warn!("unrecognized file in postgres datadir: {:?} ({})", path, e);
e
})?;
let mut file = File::open(path)?;
let mut buf: [u8; 8192] = [0u8; 8192];
let len = file.metadata().unwrap().len();
ensure!(len % pg_constants::BLCKSZ as u64 == 0);
let nblocks = len / pg_constants::BLCKSZ as u64;
if segno != 0 {
todo!();
}
let rel = RelTag {
spcnode: spcoid,
dbnode: dboid,
relnode,
forknum,
};
modification.put_rel_creation(rel, nblocks as u32)?;
let mut blknum: u32 = segno * (1024 * 1024 * 1024 / pg_constants::BLCKSZ as u32);
loop {
let r = file.read_exact(&mut buf);
match r {
Ok(_) => {
let rel = RelTag {
spcnode: spcoid,
dbnode: dboid,
relnode,
forknum,
};
let tag = RelishTag::Relation(rel);
timeline.put_page_image(tag, blknum, lsn, Bytes::copy_from_slice(&buf))?;
modification.put_rel_page_image(rel, blknum, Bytes::copy_from_slice(&buf))?;
}
// TODO: UnexpectedEof is expected
Err(err) => match err.kind() {
std::io::ErrorKind::UnexpectedEof => {
// reached EOF. That's expected.
// FIXME: maybe check that we read the full length of the file?
ensure!(blknum == nblocks as u32, "unexpected EOF");
break;
}
_ => {
@@ -203,16 +218,28 @@ fn import_relfile(
Ok(())
}
///
/// Import a "non-blocky" file into the repository
///
/// This is used for small files like the control file, twophase files etc. that
/// are just slurped into the repository as one blob.
///
fn import_nonrel_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
tag: RelishTag,
/// Import a relmapper (pg_filenode.map) file into the repository
fn import_relmap_file<R: Repository>(
modification: &mut DatadirModification<R>,
spcnode: Oid,
dbnode: Oid,
path: &Path,
) -> Result<()> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
// read the whole file
file.read_to_end(&mut buffer)?;
trace!("importing relmap file {}", path.display());
modification.put_relmap_file(spcnode, dbnode, Bytes::copy_from_slice(&buffer[..]))?;
Ok(())
}
/// Import a twophase state file (pg_twophase/<xid>) into the repository
fn import_twophase_file<R: Repository>(
modification: &mut DatadirModification<R>,
xid: TransactionId,
path: &Path,
) -> Result<()> {
let mut file = File::open(path)?;
@@ -222,7 +249,7 @@ fn import_nonrel_file(
trace!("importing non-rel file {}", path.display());
timeline.put_page_image(tag, 0, lsn, Bytes::copy_from_slice(&buffer[..]))?;
modification.put_twophase_file(xid, Bytes::copy_from_slice(&buffer[..]))?;
Ok(())
}
@@ -231,9 +258,8 @@ fn import_nonrel_file(
///
/// The control file is imported as is, but we also extract the checkpoint record
/// from it and store it separated.
fn import_control_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
fn import_control_file<R: Repository>(
modification: &mut DatadirModification<R>,
path: &Path,
) -> Result<ControlFileData> {
let mut file = File::open(path)?;
@@ -244,17 +270,12 @@ fn import_control_file(
trace!("importing control file {}", path.display());
// Import it as ControlFile
timeline.put_page_image(
RelishTag::ControlFile,
0,
lsn,
Bytes::copy_from_slice(&buffer[..]),
)?;
modification.put_control_file(Bytes::copy_from_slice(&buffer[..]))?;
// Extract the checkpoint record and import it separately.
let pg_control = ControlFileData::decode(&buffer)?;
let checkpoint_bytes = pg_control.checkPointCopy.encode();
timeline.put_page_image(RelishTag::Checkpoint, 0, lsn, checkpoint_bytes)?;
modification.put_checkpoint(checkpoint_bytes)?;
Ok(pg_control)
}
@@ -262,28 +283,34 @@ fn import_control_file(
///
/// Import an SLRU segment file
///
fn import_slru_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
fn import_slru_file<R: Repository>(
modification: &mut DatadirModification<R>,
slru: SlruKind,
path: &Path,
) -> Result<()> {
// Does it look like an SLRU file?
trace!("importing slru file {}", path.display());
let mut file = File::open(path)?;
let mut buf: [u8; 8192] = [0u8; 8192];
let segno = u32::from_str_radix(path.file_name().unwrap().to_str().unwrap(), 16)?;
let segno = u32::from_str_radix(&path.file_name().unwrap().to_string_lossy(), 16)?;
trace!("importing slru file {}", path.display());
let len = file.metadata().unwrap().len();
ensure!(len % pg_constants::BLCKSZ as u64 == 0); // we assume SLRU block size is the same as BLCKSZ
let nblocks = len / pg_constants::BLCKSZ as u64;
ensure!(nblocks <= pg_constants::SLRU_PAGES_PER_SEGMENT as u64);
modification.put_slru_segment_creation(slru, segno, nblocks as u32)?;
let mut rpageno = 0;
loop {
let r = file.read_exact(&mut buf);
match r {
Ok(_) => {
timeline.put_page_image(
RelishTag::Slru { slru, segno },
modification.put_slru_page_image(
slru,
segno,
rpageno,
lsn,
Bytes::copy_from_slice(&buf),
)?;
}
@@ -292,7 +319,7 @@ fn import_slru_file(
Err(err) => match err.kind() {
std::io::ErrorKind::UnexpectedEof => {
// reached EOF. That's expected.
// FIXME: maybe check that we read the full length of the file?
ensure!(rpageno == nblocks as u32, "unexpected EOF");
break;
}
_ => {
@@ -301,8 +328,6 @@ fn import_slru_file(
},
};
rpageno += 1;
// TODO: Check that the file isn't unexpectedly large, not larger than SLRU_PAGES_PER_SEGMENT pages
}
Ok(())
@@ -310,9 +335,9 @@ fn import_slru_file(
/// Scan PostgreSQL WAL files in given directory and load all records between
/// 'startpoint' and 'endpoint' into the repository.
fn import_wal(
fn import_wal<R: Repository>(
walpath: &Path,
writer: &dyn TimelineWriter,
tline: &mut DatadirTimeline<R>,
startpoint: Lsn,
endpoint: Lsn,
) -> Result<()> {
@@ -322,7 +347,7 @@ fn import_wal(
let mut offset = startpoint.segment_offset(pg_constants::WAL_SEGMENT_SIZE);
let mut last_lsn = startpoint;
let mut walingest = WalIngest::new(writer.deref(), startpoint)?;
let mut walingest = WalIngest::new(tline, startpoint)?;
while last_lsn <= endpoint {
// FIXME: assume postgresql tli 1 for now
@@ -355,7 +380,7 @@ fn import_wal(
let mut nrecords = 0;
while last_lsn <= endpoint {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
walingest.ingest_record(writer, recdata, lsn)?;
walingest.ingest_record(tline, recdata, lsn)?;
last_lsn = lsn;
nrecords += 1;

131
pageserver/src/keyspace.rs Normal file
View File

@@ -0,0 +1,131 @@
use crate::repository::{key_range_size, singleton_range, Key};
use postgres_ffi::pg_constants;
use std::ops::Range;
///
/// Represents a set of Keys, in a compact form.
///
#[derive(Clone, Debug)]
pub struct KeySpace {
/// Contiguous ranges of keys that belong to the key space. In key order,
/// and with no overlap.
pub ranges: Vec<Range<Key>>,
}
impl KeySpace {
///
/// Partition a key space into roughly chunks of roughly 'target_size' bytes
/// in each patition.
///
pub fn partition(&self, target_size: u64) -> KeyPartitioning {
// Assume that each value is 8k in size.
let target_nblocks = (target_size / pg_constants::BLCKSZ as u64) as usize;
let mut parts = Vec::new();
let mut current_part = Vec::new();
let mut current_part_size: usize = 0;
for range in &self.ranges {
// If appending the next contiguous range in the keyspace to the current
// partition would cause it to be too large, start a new partition.
let this_size = key_range_size(range) as usize;
if current_part_size + this_size > target_nblocks && !current_part.is_empty() {
parts.push(KeySpace {
ranges: current_part,
});
current_part = Vec::new();
current_part_size = 0;
}
// If the next range is larger than 'target_size', split it into
// 'target_size' chunks.
let mut remain_size = this_size;
let mut start = range.start;
while remain_size > target_nblocks {
let next = start.add(target_nblocks as u32);
parts.push(KeySpace {
ranges: vec![start..next],
});
start = next;
remain_size -= target_nblocks
}
current_part.push(start..range.end);
current_part_size += remain_size;
}
// add last partition that wasn't full yet.
if !current_part.is_empty() {
parts.push(KeySpace {
ranges: current_part,
});
}
KeyPartitioning { parts }
}
}
///
/// Represents a partitioning of the key space.
///
/// The only kind of partitioning we do is to partition the key space into
/// partitions that are roughly equal in physical size (see KeySpace::partition).
/// But this data structure could represent any partitioning.
///
#[derive(Clone, Debug, Default)]
pub struct KeyPartitioning {
pub parts: Vec<KeySpace>,
}
impl KeyPartitioning {
pub fn new() -> Self {
KeyPartitioning { parts: Vec::new() }
}
}
///
/// A helper object, to collect a set of keys and key ranges into a KeySpace
/// object. This takes care of merging adjacent keys and key ranges into
/// contiguous ranges.
///
#[derive(Clone, Debug, Default)]
pub struct KeySpaceAccum {
accum: Option<Range<Key>>,
ranges: Vec<Range<Key>>,
}
impl KeySpaceAccum {
pub fn new() -> Self {
Self {
accum: None,
ranges: Vec::new(),
}
}
pub fn add_key(&mut self, key: Key) {
self.add_range(singleton_range(key))
}
pub fn add_range(&mut self, range: Range<Key>) {
match self.accum.as_mut() {
Some(accum) => {
if range.start == accum.end {
accum.end = range.end;
} else {
assert!(range.start > accum.end);
self.ranges.push(accum.clone());
*accum = range;
}
}
None => self.accum = Some(range),
}
}
pub fn to_keyspace(mut self) -> KeySpace {
if let Some(accum) = self.accum.take() {
self.ranges.push(accum);
}
KeySpace {
ranges: self.ranges,
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,40 +1,42 @@
# Overview
The on-disk format is based on immutable files. The page server receives a
stream of incoming WAL, parses the WAL records to determine which pages they
apply to, and accumulates the incoming changes in memory. Every now and then,
the accumulated changes are written out to new immutable files. This process is
called checkpointing. Old versions of on-disk files that are not needed by any
timeline are removed by GC process.
The main responsibility of the Page Server is to process the incoming WAL, and
reprocess it into a format that allows reasonably quick access to any page
version.
version. The page server slices the incoming WAL per relation and page, and
packages the sliced WAL into suitably-sized "layer files". The layer files
contain all the history of the database, back to some reasonable retention
period. This system replaces the base backups and the WAL archive used in a
traditional PostgreSQL installation. The layer files are immutable, they are not
modified in-place after creation. New layer files are created for new incoming
WAL, and old layer files are removed when they are no longer needed.
The on-disk format is based on immutable files. The page server receives a
stream of incoming WAL, parses the WAL records to determine which pages they
apply to, and accumulates the incoming changes in memory. Whenever enough WAL
has been accumulated in memory, it is written out to a new immutable file. That
process accumulates "L0 delta files" on disk. When enough L0 files have been
accumulated, they are merged and re-partitioned into L1 files, and old files
that are no longer needed are removed by Garbage Collection (GC).
The incoming WAL contains updates to arbitrary pages in the system. The
distribution depends on the workload: the updates could be totally random, or
there could be a long stream of updates to a single relation when data is bulk
loaded, for example, or something in between. The page server slices the
incoming WAL per relation and page, and packages the sliced WAL into
suitably-sized "layer files". The layer files contain all the history of the
database, back to some reasonable retention period. This system replaces the
base backups and the WAL archive used in a traditional PostgreSQL
installation. The layer files are immutable, they are not modified in-place
after creation. New layer files are created for new incoming WAL, and old layer
files are removed when they are no longer needed. We could also replace layer
files with new files that contain the same information, merging small files for
example, but that hasn't been implemented yet.
loaded, for example, or something in between.
Cloud Storage Page Server Safekeeper
L1 L0 Memory WAL
Cloud Storage Page Server Safekeeper
Local disk Memory WAL
|AAAA| |AAAA|AAAA| |AA
|BBBB| |BBBB|BBBB| |
|CCCC|CCCC| <---- |CCCC|CCCC|CCCC| <--- |CC <---- ADEBAABED
|DDDD|DDDD| |DDDD|DDDD| |DDD
|EEEE| |EEEE|EEEE|EEEE| |E
+----+ +----+----+
|AAAA| |AAAA|AAAA| +---+-----+ |
+----+ +----+----+ | | | |AA
|BBBB| |BBBB|BBBB| |BB | AA | |BB
+----+----+ +----+----+ |C | BB | |CC
|CCCC|CCCC| <---- |CCCC|CCCC| <--- |D | CC | <--- |DDD <---- ADEBAABED
+----+----+ +----+----+ | | DDD | |E
|DDDD|DDDD| |DDDD|DDDD| |E | | |
+----+----+ +----+----+ | | |
|EEEE| |EEEE|EEEE| +---+-----+
+----+ +----+----+
In this illustration, WAL is received as a stream from the Safekeeper, from the
right. It is immediately captured by the page server and stored quickly in
@@ -42,39 +44,29 @@ memory. The page server memory can be thought of as a quick "reorder buffer",
used to hold the incoming WAL and reorder it so that we keep the WAL records for
the same page and relation close to each other.
From the page server memory, whenever enough WAL has been accumulated for one
relation segment, it is moved to local disk, as a new layer file, and the memory
is released.
From the page server memory, whenever enough WAL has been accumulated, it is flushed
to disk into a new L0 layer file, and the memory is released.
When enough L0 files have been accumulated, they are merged together rand sliced
per key-space, producing a new set of files where each file contains a more
narrow key range, but larger LSN range.
From the local disk, the layers are further copied to Cloud Storage, for
long-term archival. After a layer has been copied to Cloud Storage, it can be
removed from local disk, although we currently keep everything locally for fast
access. If a layer is needed that isn't found locally, it is fetched from Cloud
Storage and stored in local disk.
# Terms used in layered repository
- Relish - one PostgreSQL relation or similarly treated file.
- Segment - one slice of a Relish that is stored in a LayeredTimeline.
- Layer - specific version of a relish Segment in a range of LSNs.
Storage and stored in local disk. L0 and L1 files are both uploaded to Cloud
Storage.
# Layer map
The LayerMap tracks what layers exist for all the relishes in a timeline.
LayerMap consists of two data structures:
- segs - All the layers keyed by segment tag
- open_layers - data structure that hold all open layers ordered by oldest_pending_lsn for quick access during checkpointing. oldest_pending_lsn is the LSN of the oldest page version stored in this layer.
All operations that update InMemory Layers should update both structures to keep them up-to-date.
- LayeredTimeline - implements Timeline interface.
All methods of LayeredTimeline are aware of its ancestors and return data taking them into account.
TODO: Are there any exceptions to this?
For example, timeline.list_rels(lsn) will return all segments that are visible in this timeline at the LSN,
including ones that were not modified in this timeline and thus don't have a layer in the timeline's LayerMap.
The LayerMap tracks what layers exist in a timeline.
Currently, the layer map is just a resizeable array (Vec). On a GetPage@LSN or
other read request, the layer map scans through the array to find the right layer
that contains the data for the requested page. The read-code in LayeredTimeline
is aware of the ancestor, and returns data from the ancestor timeline if it's
not found on the current timeline.
# Different kinds of layers
@@ -92,11 +84,11 @@ To avoid OOM errors, InMemory layers can be spilled to disk into ephemeral file.
TODO: Clarify the difference between Closed, Historic and Frozen.
There are two kinds of OnDisk layers:
- ImageLayer represents an image or a snapshot of a 10 MB relish segment, at one particular LSN.
- DeltaLayer represents a collection of WAL records or page images in a range of LSNs, for one
relish segment.
Dropped segments are always represented on disk by DeltaLayer.
- ImageLayer represents a snapshot of all the keys in a particular range, at one
particular LSN. Any keys that are not present in the ImageLayer are known not
to exist at that LSN.
- DeltaLayer represents a collection of WAL records or page images in a range of
LSNs, for a range of keys.
# Layer life cycle
@@ -109,71 +101,71 @@ layer or a delta layer, it is a valid end bound. An image layer represents
snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
Every layer starts its life as an Open In-Memory layer. When the page server
receives the first WAL record for a segment, it creates a new In-Memory layer
for it, and puts it to the layer map. Later, the layer is old enough, its
contents are written to disk, as On-Disk layers. This process is called
"evicting" a layer.
receives the first WAL record for a timeline, it creates a new In-Memory layer
for it, and puts it to the layer map. Later, when the layer becomes full, its
contents are written to disk, as an on-disk layers.
Layer eviction is a two-step process: First, the layer is marked as closed, so
that it no longer accepts new WAL records, and the layer map is updated
accordingly. If a new WAL record for that segment arrives after this step, a new
Open layer is created to hold it. After this first step, the layer is a Closed
Flushing a layer is a two-step process: First, the layer is marked as closed, so
that it no longer accepts new WAL records, and a new in-memory layer is created
to hold any WAL after that point. After this first step, the layer is a Closed
InMemory state. This first step is called "freezing" the layer.
In the second step, new Delta and Image layers are created, containing all the
data in the Frozen InMemory layer. When the new layers are ready, the original
frozen layer is replaced with the new layers in the layer map, and the original
frozen layer is dropped, releasing the memory.
In the second step, a new Delta layers is created, containing all the data from
the Frozen InMemory layer. When it has been created and flushed to disk, the
original frozen layer is replaced with the new layers in the layer map, and the
original frozen layer is dropped, releasing the memory.
# Layer files (On-disk layers)
The files are called "layer files". Each layer file corresponds
to one RELISH_SEG_SIZE slice of a PostgreSQL relation fork or
non-rel file in a range of LSNs. The layer files
for each timeline are stored in the timeline's subdirectory under
The files are called "layer files". Each layer file covers a range of keys, and
a range of LSNs (or a single LSN, in case of image layers). You can think of it
as a rectangle in the two-dimensional key-LSN space. The layer files for each
timeline are stored in the timeline's subdirectory under
.zenith/tenants/<tenantid>/timelines.
There are two kind of layer file: base images, and deltas. A base
image file contains a layer of a segment as it was at one LSN,
whereas a delta file contains modifications to a segment - mostly in
the form of WAL records - in a range of LSN
There are two kind of layer files: images, and delta layers. An image file
contains a snapshot of all keys at a particular LSN, whereas a delta file
contains modifications to a segment - mostly in the form of WAL records - in a
range of LSN.
base image file:
image file:
rel_<spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<start LSN>
000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568
start key end key LSN
The first parts define the key range that the layer covers. See
pgdatadir_mapping.rs for how the key space is used. The last part is the LSN.
delta file:
rel_<spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<start LSN>_<end LSN>
Delta files are named similarly, but they cover a range of LSNs:
For example:
000000067F000032BE0000400000000020B6-000000067F000032BE0000400000000030B6__000000578C6B29-0000000057A50051
start key end key start LSN end LSN
rel_1663_13990_2609_0_10_000000000169C348
rel_1663_13990_2609_0_10_000000000169C348_0000000001702000
A delta file contains all the key-values in the key-range that were updated in
the LSN range. If a key has not been modified, there is no trace of it in the
delta layer.
In addition to the relations, with "rel_*" prefix, we use the same
format for storing various smaller files from the PostgreSQL data
directory. They will use different suffixes and the naming scheme up
to the LSNs vary. The Zenith source code uses the term "relish" to
mean "a relation, or other file that's treated like a relation in the
storage" For example, a base image of a CLOG segment would be named
like this:
pg_xact_0000_0_00000000198B06B0
A delta layer file can cover a part of the overall key space, as in the previous
example, or the whole key range like this:
There is no difference in how the relation and non-relation files are
managed, except that the first part of file names is different.
Internally, the relations and non-relation files that are managed in
the versioned store are together called "relishes".
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000578C6B29-0000000057A50051
If a file has been dropped, the last layer file for it is created
with the _DROPPED suffix, e.g.
rel_1663_13990_2609_0_10_000000000169C348_0000000001702000_DROPPED
A file that covers the whole key range is called a L0 file (Level 0), while a
file that covers only part of the key range is called a L1 file. The "level" of
a file is not explicitly stored anywhere, you can only distinguish them by
looking at the key range that a file covers. The read-path doesn't need to
treat L0 and L1 files any differently.
## Notation used in this document
FIXME: This is somewhat obsolete, the layer files cover a key-range rather than
a particular relation nowadays. However, the description on how you find a page
version, and how branching and GC works is still valid.
The full path of a delta file looks like this:
.zenith/tenants/941ddc8604413b88b3d208bddf90396c/timelines/4af489b06af8eed9e27a841775616962/rel_1663_13990_2609_0_10_000000000169C348_0000000001702000

View File

@@ -0,0 +1,139 @@
//!
//! Functions for reading and writing variable-sized "blobs".
//!
//! Each blob begins with a 4-byte length, followed by the actual data.
//!
use crate::layered_repository::block_io::{BlockCursor, BlockReader};
use crate::page_cache::PAGE_SZ;
use std::cmp::min;
use std::io::Error;
/// For reading
pub trait BlobCursor {
/// Read a blob into a new buffer.
fn read_blob(&mut self, offset: u64) -> Result<Vec<u8>, std::io::Error> {
let mut buf = Vec::new();
self.read_blob_into_buf(offset, &mut buf)?;
Ok(buf)
}
/// Read blob into the given buffer. Any previous contents in the buffer
/// are overwritten.
fn read_blob_into_buf(
&mut self,
offset: u64,
dstbuf: &mut Vec<u8>,
) -> Result<(), std::io::Error>;
}
impl<'a, R> BlobCursor for BlockCursor<R>
where
R: BlockReader,
{
fn read_blob_into_buf(
&mut self,
offset: u64,
dstbuf: &mut Vec<u8>,
) -> Result<(), std::io::Error> {
let mut blknum = (offset / PAGE_SZ as u64) as u32;
let mut off = (offset % PAGE_SZ as u64) as usize;
let mut buf = self.read_blk(blknum)?;
// read length
let mut len_buf = [0u8; 4];
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it is split across two pages
len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]);
blknum += 1;
buf = self.read_blk(blknum)?;
len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]);
off = 4 - thislen;
} else {
len_buf.copy_from_slice(&buf[off..off + 4]);
off += 4;
}
let len = u32::from_ne_bytes(len_buf) as usize;
dstbuf.clear();
// Read the payload
let mut remain = len;
while remain > 0 {
let mut page_remain = PAGE_SZ - off;
if page_remain == 0 {
// continue on next page
blknum += 1;
buf = self.read_blk(blknum)?;
off = 0;
page_remain = PAGE_SZ;
}
let this_blk_len = min(remain, page_remain);
dstbuf.extend_from_slice(&buf[off..off + this_blk_len]);
remain -= this_blk_len;
off += this_blk_len;
}
Ok(())
}
}
///
/// Abstract trait for a data sink that you can write blobs to.
///
pub trait BlobWriter {
/// Write a blob of data. Returns the offset that it was written to,
/// which can be used to retrieve the data later.
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error>;
}
///
/// An implementation of BlobWriter to write blobs to anything that
/// implements std::io::Write.
///
pub struct WriteBlobWriter<W>
where
W: std::io::Write,
{
inner: W,
offset: u64,
}
impl<W> WriteBlobWriter<W>
where
W: std::io::Write,
{
pub fn new(inner: W, start_offset: u64) -> Self {
WriteBlobWriter {
inner,
offset: start_offset,
}
}
pub fn size(&self) -> u64 {
self.offset
}
/// Access the underlying Write object.
///
/// NOTE: WriteBlobWriter keeps track of the current write offset. If
/// you write something directly to the inner Write object, it makes the
/// internally tracked 'offset' to go out of sync. So don't do that.
pub fn into_inner(self) -> W {
self.inner
}
}
impl<W> BlobWriter for WriteBlobWriter<W>
where
W: std::io::Write,
{
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
let offset = self.offset;
self.inner
.write_all(&((srcbuf.len()) as u32).to_ne_bytes())?;
self.inner.write_all(srcbuf)?;
self.offset += 4 + srcbuf.len() as u64;
Ok(offset)
}
}

View File

@@ -0,0 +1,218 @@
//!
//! Low-level Block-oriented I/O functions
//!
use crate::page_cache;
use crate::page_cache::{ReadBufResult, PAGE_SZ};
use bytes::Bytes;
use lazy_static::lazy_static;
use std::ops::{Deref, DerefMut};
use std::os::unix::fs::FileExt;
use std::sync::atomic::AtomicU64;
/// This is implemented by anything that can read 8 kB (PAGE_SZ)
/// blocks, using the page cache
///
/// There are currently two implementations: EphemeralFile, and FileBlockReader
/// below.
pub trait BlockReader {
type BlockLease: Deref<Target = [u8; PAGE_SZ]> + 'static;
///
/// Read a block. Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
///
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error>;
///
/// Create a new "cursor" for reading from this reader.
///
/// A cursor caches the last accessed page, allowing for faster
/// access if the same block is accessed repeatedly.
fn block_cursor(&self) -> BlockCursor<&Self>
where
Self: Sized,
{
BlockCursor::new(self)
}
}
impl<B> BlockReader for &B
where
B: BlockReader,
{
type BlockLease = B::BlockLease;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
(*self).read_blk(blknum)
}
}
///
/// A "cursor" for efficiently reading multiple pages from a BlockReader
///
/// A cursor caches the last accessed page, allowing for faster access if the
/// same block is accessed repeatedly.
///
/// You can access the last page with `*cursor`. 'read_blk' returns 'self', so
/// that in many cases you can use a BlockCursor as a drop-in replacement for
/// the underlying BlockReader. For example:
///
/// ```no_run
/// # use pageserver::layered_repository::block_io::{BlockReader, FileBlockReader};
/// # let reader: FileBlockReader<std::fs::File> = todo!();
/// let cursor = reader.block_cursor();
/// let buf = cursor.read_blk(1);
/// // do stuff with 'buf'
/// let buf = cursor.read_blk(2);
/// // do stuff with 'buf'
/// ```
///
pub struct BlockCursor<R>
where
R: BlockReader,
{
reader: R,
/// last accessed page
cache: Option<(u32, R::BlockLease)>,
}
impl<R> BlockCursor<R>
where
R: BlockReader,
{
pub fn new(reader: R) -> Self {
BlockCursor {
reader,
cache: None,
}
}
pub fn read_blk(&mut self, blknum: u32) -> Result<&Self, std::io::Error> {
// Fast return if this is the same block as before
if let Some((cached_blk, _buf)) = &self.cache {
if *cached_blk == blknum {
return Ok(self);
}
}
// Read the block from the underlying reader, and cache it
self.cache = None;
let buf = self.reader.read_blk(blknum)?;
self.cache = Some((blknum, buf));
Ok(self)
}
}
impl<R> Deref for BlockCursor<R>
where
R: BlockReader,
{
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &<Self as Deref>::Target {
&self.cache.as_ref().unwrap().1
}
}
lazy_static! {
static ref NEXT_ID: AtomicU64 = AtomicU64::new(1);
}
/// An adapter for reading a (virtual) file using the page cache.
///
/// The file is assumed to be immutable. This doesn't provide any functions
/// for modifying the file, nor for invalidating the cache if it is modified.
pub struct FileBlockReader<F> {
pub file: F,
/// Unique ID of this file, used as key in the page cache.
file_id: u64,
}
impl<F> FileBlockReader<F>
where
F: FileExt,
{
pub fn new(file: F) -> Self {
let file_id = NEXT_ID.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
FileBlockReader { file_id, file }
}
/// Read a page from the underlying file into given buffer.
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> {
assert!(buf.len() == PAGE_SZ);
self.file.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64)
}
}
impl<F> BlockReader for FileBlockReader<F>
where
F: FileExt,
{
type BlockLease = page_cache::PageReadGuard<'static>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
// Look up the right page
let cache = page_cache::get();
loop {
match cache.read_immutable_buf(self.file_id, blknum) {
ReadBufResult::Found(guard) => break Ok(guard),
ReadBufResult::NotFound(mut write_guard) => {
// Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum)?;
write_guard.mark_valid();
// Swap for read lock
continue;
}
};
}
}
}
///
/// Trait for block-oriented output
///
pub trait BlockWriter {
///
/// Write a page to the underlying storage.
///
/// 'buf' must be of size PAGE_SZ. Returns the block number the page was
/// written to.
///
fn write_blk(&mut self, buf: Bytes) -> Result<u32, std::io::Error>;
}
///
/// A simple in-memory buffer of blocks.
///
pub struct BlockBuf {
pub blocks: Vec<Bytes>,
}
impl BlockWriter for BlockBuf {
fn write_blk(&mut self, buf: Bytes) -> Result<u32, std::io::Error> {
assert!(buf.len() == PAGE_SZ);
let blknum = self.blocks.len();
self.blocks.push(buf);
Ok(blknum as u32)
}
}
impl BlockBuf {
pub fn new() -> Self {
BlockBuf { blocks: Vec::new() }
}
pub fn size(&self) -> u64 {
(self.blocks.len() * PAGE_SZ) as u64
}
}
impl Default for BlockBuf {
fn default() -> Self {
Self::new()
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,979 @@
//!
//! Simple on-disk B-tree implementation
//!
//! This is used as the index structure within image and delta layers
//!
//! Features:
//! - Fixed-width keys
//! - Fixed-width values (VALUE_SZ)
//! - The tree is created in a bulk operation. Insert/deletion after creation
//! is not suppported
//! - page-oriented
//!
//! TODO:
//! - better errors (e.g. with thiserror?)
//! - maybe something like an Adaptive Radix Tree would be more efficient?
//! - the values stored by image and delta layers are offsets into the file,
//! and they are in monotonically increasing order. Prefix compression would
//! be very useful for them, too.
//! - An Iterator interface would be more convenient for the callers than the
//! 'visit' function
//!
use anyhow;
use byteorder::{ReadBytesExt, BE};
use bytes::{BufMut, Bytes, BytesMut};
use hex;
use std::cmp::Ordering;
use crate::layered_repository::block_io::{BlockReader, BlockWriter};
// The maximum size of a value stored in the B-tree. 5 bytes is enough currently.
pub const VALUE_SZ: usize = 5;
pub const MAX_VALUE: u64 = 0x007f_ffff_ffff;
#[allow(dead_code)]
pub const PAGE_SZ: usize = 8192;
#[derive(Clone, Copy, Debug)]
struct Value([u8; VALUE_SZ]);
impl Value {
fn from_slice(slice: &[u8]) -> Value {
let mut b = [0u8; VALUE_SZ];
b.copy_from_slice(slice);
Value(b)
}
fn from_u64(x: u64) -> Value {
assert!(x <= 0x007f_ffff_ffff);
Value([
(x >> 32) as u8,
(x >> 24) as u8,
(x >> 16) as u8,
(x >> 8) as u8,
x as u8,
])
}
fn from_blknum(x: u32) -> Value {
Value([
0x80,
(x >> 24) as u8,
(x >> 16) as u8,
(x >> 8) as u8,
x as u8,
])
}
#[allow(dead_code)]
fn is_offset(self) -> bool {
self.0[0] & 0x80 != 0
}
fn to_u64(self) -> u64 {
let b = &self.0;
(b[0] as u64) << 32
| (b[1] as u64) << 24
| (b[2] as u64) << 16
| (b[3] as u64) << 8
| b[4] as u64
}
fn to_blknum(self) -> u32 {
let b = &self.0;
assert!(b[0] == 0x80);
(b[1] as u32) << 24 | (b[2] as u32) << 16 | (b[3] as u32) << 8 | b[4] as u32
}
}
/// This is the on-disk representation.
struct OnDiskNode<'a, const L: usize> {
// Fixed-width fields
num_children: u16,
level: u8,
prefix_len: u8,
suffix_len: u8,
// Variable-length fields. These are stored on-disk after the fixed-width
// fields, in this order. In the in-memory representation, these point to
// the right parts in the page buffer.
prefix: &'a [u8],
keys: &'a [u8],
values: &'a [u8],
}
impl<'a, const L: usize> OnDiskNode<'a, L> {
///
/// Interpret a PAGE_SZ page as a node.
///
fn deparse(buf: &[u8]) -> OnDiskNode<L> {
let mut cursor = std::io::Cursor::new(buf);
let num_children = cursor.read_u16::<BE>().unwrap();
let level = cursor.read_u8().unwrap();
let prefix_len = cursor.read_u8().unwrap();
let suffix_len = cursor.read_u8().unwrap();
let mut off = cursor.position();
let prefix_off = off as usize;
off += prefix_len as u64;
let keys_off = off as usize;
let keys_len = num_children as usize * suffix_len as usize;
off += keys_len as u64;
let values_off = off as usize;
let values_len = num_children as usize * VALUE_SZ as usize;
//off += values_len as u64;
let prefix = &buf[prefix_off..prefix_off + prefix_len as usize];
let keys = &buf[keys_off..keys_off + keys_len];
let values = &buf[values_off..values_off + values_len];
OnDiskNode {
num_children,
level,
prefix_len,
suffix_len,
prefix,
keys,
values,
}
}
///
/// Read a value at 'idx'
///
fn value(&self, idx: usize) -> Value {
let value_off = idx * VALUE_SZ;
let value_slice = &self.values[value_off..value_off + VALUE_SZ];
Value::from_slice(value_slice)
}
fn binary_search(&self, search_key: &[u8; L], keybuf: &mut [u8]) -> Result<usize, usize> {
let mut size = self.num_children as usize;
let mut low = 0;
let mut high = size;
while low < high {
let mid = low + size / 2;
let key_off = mid as usize * self.suffix_len as usize;
let suffix = &self.keys[key_off..key_off + self.suffix_len as usize];
// Does this match?
keybuf[self.prefix_len as usize..].copy_from_slice(suffix);
let cmp = keybuf[..].cmp(search_key);
if cmp == Ordering::Less {
low = mid + 1;
} else if cmp == Ordering::Greater {
high = mid;
} else {
return Ok(mid);
}
size = high - low;
}
Err(low)
}
}
///
/// Public reader object, to search the tree.
///
pub struct DiskBtreeReader<R, const L: usize>
where
R: BlockReader,
{
start_blk: u32,
root_blk: u32,
reader: R,
}
#[derive(Clone, Copy, Debug, PartialEq)]
pub enum VisitDirection {
Forwards,
Backwards,
}
impl<R, const L: usize> DiskBtreeReader<R, L>
where
R: BlockReader,
{
pub fn new(start_blk: u32, root_blk: u32, reader: R) -> Self {
DiskBtreeReader {
start_blk,
root_blk,
reader,
}
}
///
/// Read the value for given key. Returns the value, or None if it doesn't exist.
///
pub fn get(&self, search_key: &[u8; L]) -> anyhow::Result<Option<u64>> {
let mut result: Option<u64> = None;
self.visit(search_key, VisitDirection::Forwards, |key, value| {
if key == search_key {
result = Some(value);
}
false
})?;
Ok(result)
}
///
/// Scan the tree, starting from 'search_key', in the given direction. 'visitor'
/// will be called for every key >= 'search_key' (or <= 'search_key', if scanning
/// backwards)
///
pub fn visit<V>(
&self,
search_key: &[u8; L],
dir: VisitDirection,
mut visitor: V,
) -> anyhow::Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
self.search_recurse(self.root_blk, search_key, dir, &mut visitor)
}
fn search_recurse<V>(
&self,
node_blknum: u32,
search_key: &[u8; L],
dir: VisitDirection,
visitor: &mut V,
) -> anyhow::Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
// Locate the node.
let blk = self.reader.read_blk(self.start_blk + node_blknum)?;
// Search all entries on this node
self.search_node(blk.as_ref(), search_key, dir, visitor)
}
fn search_node<V>(
&self,
node_buf: &[u8],
search_key: &[u8; L],
dir: VisitDirection,
visitor: &mut V,
) -> anyhow::Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
let node = OnDiskNode::deparse(node_buf);
let prefix_len = node.prefix_len as usize;
let suffix_len = node.suffix_len as usize;
assert!(node.num_children > 0);
let mut keybuf = Vec::new();
keybuf.extend(node.prefix);
keybuf.resize(prefix_len + suffix_len, 0);
if dir == VisitDirection::Forwards {
// Locate the first match
let mut idx = match node.binary_search(search_key, keybuf.as_mut_slice()) {
Ok(idx) => idx,
Err(idx) => {
if node.level == 0 {
// Imagine that the node contains the following keys:
//
// 1
// 3 <-- idx
// 5
//
// If the search key is '2' and there is exact match,
// the binary search would return the index of key
// '3'. That's cool, '3' is the first key to return.
idx
} else {
// This is an internal page, so each key represents a lower
// bound for what's in the child page. If there is no exact
// match, we have to return the *previous* entry.
//
// 1 <-- return this
// 3 <-- idx
// 5
idx.saturating_sub(1)
}
}
};
// idx points to the first match now. Keep going from there
let mut key_off = idx * suffix_len;
while idx < node.num_children as usize {
let suffix = &node.keys[key_off..key_off + suffix_len];
keybuf[prefix_len..].copy_from_slice(suffix);
let value = node.value(idx as usize);
#[allow(clippy::collapsible_if)]
if node.level == 0 {
// leaf
if !visitor(&keybuf, value.to_u64()) {
return Ok(false);
}
} else {
#[allow(clippy::collapsible_if)]
if !self.search_recurse(value.to_blknum(), search_key, dir, visitor)? {
return Ok(false);
}
}
idx += 1;
key_off += suffix_len;
}
} else {
let mut idx = match node.binary_search(search_key, keybuf.as_mut_slice()) {
Ok(idx) => {
// Exact match. That's the first entry to return, and walk
// backwards from there. (The loop below starts from 'idx -
// 1', so add one here to compensate.)
idx + 1
}
Err(idx) => {
// No exact match. The binary search returned the index of the
// first key that's > search_key. Back off by one, and walk
// backwards from there. (The loop below starts from idx - 1,
// so we don't need to subtract one here)
idx
}
};
// idx points to the first match + 1 now. Keep going from there.
let mut key_off = idx * suffix_len;
while idx > 0 {
idx -= 1;
key_off -= suffix_len;
let suffix = &node.keys[key_off..key_off + suffix_len];
keybuf[prefix_len..].copy_from_slice(suffix);
let value = node.value(idx as usize);
#[allow(clippy::collapsible_if)]
if node.level == 0 {
// leaf
if !visitor(&keybuf, value.to_u64()) {
return Ok(false);
}
} else {
#[allow(clippy::collapsible_if)]
if !self.search_recurse(value.to_blknum(), search_key, dir, visitor)? {
return Ok(false);
}
}
if idx == 0 {
break;
}
}
}
Ok(true)
}
#[allow(dead_code)]
pub fn dump(&self) -> anyhow::Result<()> {
self.dump_recurse(self.root_blk, &[], 0)
}
fn dump_recurse(&self, blknum: u32, path: &[u8], depth: usize) -> anyhow::Result<()> {
let blk = self.reader.read_blk(self.start_blk + blknum)?;
let buf: &[u8] = blk.as_ref();
let node = OnDiskNode::<L>::deparse(buf);
print!("{:indent$}", "", indent = depth * 2);
println!(
"blk #{}: path {}: prefix {}, suffix_len {}",
blknum,
hex::encode(path),
hex::encode(node.prefix),
node.suffix_len
);
let mut idx = 0;
let mut key_off = 0;
while idx < node.num_children {
let key = &node.keys[key_off..key_off + node.suffix_len as usize];
let val = node.value(idx as usize);
print!("{:indent$}", "", indent = depth * 2 + 2);
println!("{}: {}", hex::encode(key), hex::encode(val.0));
if node.level > 0 {
let child_path = [path, node.prefix].concat();
self.dump_recurse(val.to_blknum(), &child_path, depth + 1)?;
}
idx += 1;
key_off += node.suffix_len as usize;
}
Ok(())
}
}
///
/// Public builder object, for creating a new tree.
///
/// Usage: Create a builder object by calling 'new', load all the data into the
/// tree by calling 'append' for each key-value pair, and then call 'finish'
///
/// 'L' is the key length in bytes
pub struct DiskBtreeBuilder<W, const L: usize>
where
W: BlockWriter,
{
writer: W,
///
/// stack[0] is the current root page, stack.last() is the leaf.
///
stack: Vec<BuildNode<L>>,
/// Last key that was appended to the tree. Used to sanity check that append
/// is called in increasing key order.
last_key: Option<[u8; L]>,
}
impl<W, const L: usize> DiskBtreeBuilder<W, L>
where
W: BlockWriter,
{
pub fn new(writer: W) -> Self {
DiskBtreeBuilder {
writer,
last_key: None,
stack: vec![BuildNode::new(0)],
}
}
pub fn append(&mut self, key: &[u8; L], value: u64) -> Result<(), anyhow::Error> {
assert!(value <= MAX_VALUE);
if let Some(last_key) = &self.last_key {
assert!(key > last_key, "unsorted input");
}
self.last_key = Some(*key);
Ok(self.append_internal(key, Value::from_u64(value))?)
}
fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<(), std::io::Error> {
// Try to append to the current leaf buffer
let last = self.stack.last_mut().unwrap();
let level = last.level;
if last.push(key, value) {
return Ok(());
}
// It did not fit. Try to compress, and it it succeeds to make some room
// on the node, try appending to it again.
#[allow(clippy::collapsible_if)]
if last.compress() {
if last.push(key, value) {
return Ok(());
}
}
// Could not append to the current leaf. Flush it and create a new one.
self.flush_node()?;
// Replace the node we flushed with an empty one and append the new
// key to it.
let mut last = BuildNode::new(level);
if !last.push(key, value) {
panic!("could not push to new leaf node");
}
self.stack.push(last);
Ok(())
}
fn flush_node(&mut self) -> Result<(), std::io::Error> {
let last = self.stack.pop().unwrap();
let buf = last.pack();
let downlink_key = last.first_key();
let downlink_ptr = self.writer.write_blk(buf)?;
// Append the downlink to the parent
if self.stack.is_empty() {
self.stack.push(BuildNode::new(last.level + 1));
}
self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr))?;
Ok(())
}
///
/// Flushes everything to disk, and returns the block number of the root page.
/// The caller must store the root block number "out-of-band", and pass it
/// to the DiskBtreeReader::new() when you want to read the tree again.
/// (In the image and delta layers, it is stored in the beginning of the file,
/// in the summary header)
///
pub fn finish(mut self) -> Result<(u32, W), std::io::Error> {
// flush all levels, except the root.
while self.stack.len() > 1 {
self.flush_node()?;
}
let root = self.stack.first().unwrap();
let buf = root.pack();
let root_blknum = self.writer.write_blk(buf)?;
Ok((root_blknum, self.writer))
}
pub fn borrow_writer(&self) -> &W {
&self.writer
}
}
///
/// BuildNode represesnts an incomplete page that we are appending to.
///
#[derive(Clone, Debug)]
struct BuildNode<const L: usize> {
num_children: u16,
level: u8,
prefix: Vec<u8>,
suffix_len: usize,
keys: Vec<u8>,
values: Vec<u8>,
size: usize, // physical size of this node, if it was written to disk like this
}
const NODE_SIZE: usize = PAGE_SZ;
const NODE_HDR_SIZE: usize = 2 + 1 + 1 + 1;
impl<const L: usize> BuildNode<L> {
fn new(level: u8) -> Self {
BuildNode {
num_children: 0,
level,
prefix: Vec::new(),
suffix_len: 0,
keys: Vec::new(),
values: Vec::new(),
size: NODE_HDR_SIZE,
}
}
/// Try to append a key-value pair to this node. Returns 'true' on
/// success, 'false' if the page was full or the key was
/// incompatible with the prefix of the existing keys.
fn push(&mut self, key: &[u8; L], value: Value) -> bool {
// If we have already performed prefix-compression on the page,
// check that the incoming key has the same prefix.
if self.num_children > 0 {
// does the prefix allow it?
if !key.starts_with(&self.prefix) {
return false;
}
} else {
self.suffix_len = key.len();
}
// Is the node too full?
if self.size + self.suffix_len + VALUE_SZ >= NODE_SIZE {
return false;
}
// All clear
self.num_children += 1;
self.keys.extend(&key[self.prefix.len()..]);
self.values.extend(value.0);
assert!(self.keys.len() == self.num_children as usize * self.suffix_len as usize);
assert!(self.values.len() == self.num_children as usize * VALUE_SZ);
self.size += self.suffix_len + VALUE_SZ;
true
}
///
/// Perform prefix-compression.
///
/// Returns 'true' on success, 'false' if no compression was possible.
///
fn compress(&mut self) -> bool {
let first_suffix = self.first_suffix();
let last_suffix = self.last_suffix();
// Find the common prefix among all keys
let mut prefix_len = 0;
while prefix_len < self.suffix_len {
if first_suffix[prefix_len] != last_suffix[prefix_len] {
break;
}
prefix_len += 1;
}
if prefix_len == 0 {
return false;
}
// Can compress. Rewrite the keys without the common prefix.
self.prefix.extend(&self.keys[..prefix_len]);
let mut new_keys = Vec::new();
let mut key_off = 0;
while key_off < self.keys.len() {
let next_key_off = key_off + self.suffix_len;
new_keys.extend(&self.keys[key_off + prefix_len..next_key_off]);
key_off = next_key_off;
}
self.keys = new_keys;
self.suffix_len -= prefix_len;
self.size -= prefix_len * self.num_children as usize;
self.size += prefix_len;
assert!(self.keys.len() == self.num_children as usize * self.suffix_len as usize);
assert!(self.values.len() == self.num_children as usize * VALUE_SZ);
true
}
///
/// Serialize the node to on-disk format.
///
fn pack(&self) -> Bytes {
assert!(self.keys.len() == self.num_children as usize * self.suffix_len as usize);
assert!(self.values.len() == self.num_children as usize * VALUE_SZ);
assert!(self.num_children > 0);
let mut buf = BytesMut::new();
buf.put_u16(self.num_children);
buf.put_u8(self.level);
buf.put_u8(self.prefix.len() as u8);
buf.put_u8(self.suffix_len as u8);
buf.put(&self.prefix[..]);
buf.put(&self.keys[..]);
buf.put(&self.values[..]);
assert!(buf.len() == self.size);
assert!(buf.len() <= PAGE_SZ);
buf.resize(PAGE_SZ, 0);
buf.freeze()
}
fn first_suffix(&self) -> &[u8] {
&self.keys[..self.suffix_len]
}
fn last_suffix(&self) -> &[u8] {
&self.keys[self.keys.len() - self.suffix_len..]
}
/// Return the full first key of the page, including the prefix
fn first_key(&self) -> [u8; L] {
let mut key = [0u8; L];
key[..self.prefix.len()].copy_from_slice(&self.prefix);
key[self.prefix.len()..].copy_from_slice(self.first_suffix());
key
}
}
#[cfg(test)]
mod tests {
use super::*;
use rand::Rng;
use std::collections::BTreeMap;
use std::sync::atomic::{AtomicUsize, Ordering};
#[derive(Clone, Default)]
struct TestDisk {
blocks: Vec<Bytes>,
}
impl TestDisk {
fn new() -> Self {
Self::default()
}
}
impl BlockReader for TestDisk {
type BlockLease = std::rc::Rc<[u8; PAGE_SZ]>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
let mut buf = [0u8; PAGE_SZ];
buf.copy_from_slice(&self.blocks[blknum as usize]);
Ok(std::rc::Rc::new(buf))
}
}
impl BlockWriter for &mut TestDisk {
fn write_blk(&mut self, buf: Bytes) -> Result<u32, std::io::Error> {
let blknum = self.blocks.len();
self.blocks.push(buf);
Ok(blknum as u32)
}
}
#[test]
fn basic() -> anyhow::Result<()> {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 6>::new(&mut disk);
let all_keys: Vec<&[u8; 6]> = vec![
b"xaaaaa", b"xaaaba", b"xaaaca", b"xabaaa", b"xababa", b"xabaca", b"xabada", b"xabadb",
];
let all_data: Vec<(&[u8; 6], u64)> = all_keys
.iter()
.enumerate()
.map(|(idx, key)| (*key, idx as u64))
.collect();
for (key, val) in all_data.iter() {
writer.append(key, *val)?;
}
let (root_offset, _writer) = writer.finish()?;
let reader = DiskBtreeReader::new(0, root_offset, disk);
reader.dump()?;
// Test the `get` function on all the keys.
for (key, val) in all_data.iter() {
assert_eq!(reader.get(key)?, Some(*val));
}
// And on some keys that don't exist
assert_eq!(reader.get(b"aaaaaa")?, None);
assert_eq!(reader.get(b"zzzzzz")?, None);
assert_eq!(reader.get(b"xaaabx")?, None);
// Test search with `visit` function
let search_key = b"xabaaa";
let expected: Vec<(Vec<u8>, u64)> = all_data
.iter()
.filter(|(key, _value)| key[..] >= search_key[..])
.map(|(key, value)| (key.to_vec(), *value))
.collect();
let mut data = Vec::new();
reader.visit(search_key, VisitDirection::Forwards, |key, value| {
data.push((key.to_vec(), value));
true
})?;
assert_eq!(data, expected);
// Test a backwards scan
let mut expected: Vec<(Vec<u8>, u64)> = all_data
.iter()
.filter(|(key, _value)| key[..] <= search_key[..])
.map(|(key, value)| (key.to_vec(), *value))
.collect();
expected.reverse();
let mut data = Vec::new();
reader.visit(search_key, VisitDirection::Backwards, |key, value| {
data.push((key.to_vec(), value));
true
})?;
assert_eq!(data, expected);
// Backward scan where nothing matches
reader.visit(b"aaaaaa", VisitDirection::Backwards, |key, value| {
panic!("found unexpected key {}: {}", hex::encode(key), value);
})?;
// Full scan
let expected: Vec<(Vec<u8>, u64)> = all_data
.iter()
.map(|(key, value)| (key.to_vec(), *value))
.collect();
let mut data = Vec::new();
reader.visit(&[0u8; 6], VisitDirection::Forwards, |key, value| {
data.push((key.to_vec(), value));
true
})?;
assert_eq!(data, expected);
Ok(())
}
#[test]
fn lots_of_keys() -> anyhow::Result<()> {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 8>::new(&mut disk);
const NUM_KEYS: u64 = 1000;
let mut all_data: BTreeMap<u64, u64> = BTreeMap::new();
for idx in 0..NUM_KEYS {
let key_int: u64 = 1 + idx * 2;
let key = u64::to_be_bytes(key_int);
writer.append(&key, idx)?;
all_data.insert(key_int, idx);
}
let (root_offset, _writer) = writer.finish()?;
let reader = DiskBtreeReader::new(0, root_offset, disk);
reader.dump()?;
use std::sync::Mutex;
let result = Mutex::new(Vec::new());
let limit: AtomicUsize = AtomicUsize::new(10);
let take_ten = |key: &[u8], value: u64| {
let mut keybuf = [0u8; 8];
keybuf.copy_from_slice(key);
let key_int = u64::from_be_bytes(keybuf);
let mut result = result.lock().unwrap();
result.push((key_int, value));
// keep going until we have 10 matches
result.len() < limit.load(Ordering::Relaxed)
};
for search_key_int in 0..(NUM_KEYS * 2 + 10) {
let search_key = u64::to_be_bytes(search_key_int);
assert_eq!(
reader.get(&search_key)?,
all_data.get(&search_key_int).cloned()
);
// Test a forward scan starting with this key
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Forwards, take_ten)?;
let expected = all_data
.range(search_key_int..)
.take(10)
.map(|(&key, &val)| (key, val))
.collect::<Vec<(u64, u64)>>();
assert_eq!(*result.lock().unwrap(), expected);
// And a backwards scan
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Backwards, take_ten)?;
let expected = all_data
.range(..=search_key_int)
.rev()
.take(10)
.map(|(&key, &val)| (key, val))
.collect::<Vec<(u64, u64)>>();
assert_eq!(*result.lock().unwrap(), expected);
}
// full scan
let search_key = u64::to_be_bytes(0);
limit.store(usize::MAX, Ordering::Relaxed);
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Forwards, take_ten)?;
let expected = all_data
.iter()
.map(|(&key, &val)| (key, val))
.collect::<Vec<(u64, u64)>>();
assert_eq!(*result.lock().unwrap(), expected);
// full scan
let search_key = u64::to_be_bytes(u64::MAX);
limit.store(usize::MAX, Ordering::Relaxed);
result.lock().unwrap().clear();
reader.visit(&search_key, VisitDirection::Backwards, take_ten)?;
let expected = all_data
.iter()
.rev()
.map(|(&key, &val)| (key, val))
.collect::<Vec<(u64, u64)>>();
assert_eq!(*result.lock().unwrap(), expected);
Ok(())
}
#[test]
fn random_data() -> anyhow::Result<()> {
// Generate random keys with exponential distribution, to
// exercise the prefix compression
const NUM_KEYS: usize = 100000;
let mut all_data: BTreeMap<u128, u64> = BTreeMap::new();
for idx in 0..NUM_KEYS {
let u: f64 = rand::thread_rng().gen_range(0.0..1.0);
let t = -(f64::ln(u));
let key_int = (t * 1000000.0) as u128;
all_data.insert(key_int as u128, idx as u64);
}
// Build a tree from it
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 16>::new(&mut disk);
for (&key, &val) in all_data.iter() {
writer.append(&u128::to_be_bytes(key), val)?;
}
let (root_offset, _writer) = writer.finish()?;
let reader = DiskBtreeReader::new(0, root_offset, disk);
// Test get() operation on all the keys
for (&key, &val) in all_data.iter() {
let search_key = u128::to_be_bytes(key);
assert_eq!(reader.get(&search_key)?, Some(val));
}
// Test get() operations on random keys, most of which will not exist
for _ in 0..100000 {
let key_int = rand::thread_rng().gen::<u128>();
let search_key = u128::to_be_bytes(key_int);
assert!(reader.get(&search_key)? == all_data.get(&key_int).cloned());
}
// Test boundary cases
assert!(reader.get(&u128::to_be_bytes(u128::MIN))? == all_data.get(&u128::MIN).cloned());
assert!(reader.get(&u128::to_be_bytes(u128::MAX))? == all_data.get(&u128::MAX).cloned());
Ok(())
}
#[test]
#[should_panic(expected = "unsorted input")]
fn unsorted_input() {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 2>::new(&mut disk);
let _ = writer.append(b"ba", 1);
let _ = writer.append(b"bb", 2);
let _ = writer.append(b"aa", 3);
}
///
/// This test contains a particular data set, see disk_btree_test_data.rs
///
#[test]
fn particular_data() -> anyhow::Result<()> {
// Build a tree from it
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 26>::new(&mut disk);
for (key, val) in disk_btree_test_data::TEST_DATA {
writer.append(&key, val)?;
}
let (root_offset, writer) = writer.finish()?;
println!("SIZE: {} blocks", writer.blocks.len());
let reader = DiskBtreeReader::new(0, root_offset, disk);
// Test get() operation on all the keys
for (key, val) in disk_btree_test_data::TEST_DATA {
assert_eq!(reader.get(&key)?, Some(val));
}
// Test full scan
let mut count = 0;
reader.visit(&[0u8; 26], VisitDirection::Forwards, |_key, _value| {
count += 1;
true
})?;
assert_eq!(count, disk_btree_test_data::TEST_DATA.len());
reader.dump()?;
Ok(())
}
}
#[cfg(test)]
#[path = "disk_btree_test_data.rs"]
mod disk_btree_test_data;

File diff suppressed because it is too large Load Diff

View File

@@ -2,6 +2,8 @@
//! used to keep in-memory layers spilled on disk.
use crate::config::PageServerConf;
use crate::layered_repository::blob_io::BlobWriter;
use crate::layered_repository::block_io::BlockReader;
use crate::page_cache;
use crate::page_cache::PAGE_SZ;
use crate::page_cache::{ReadBufResult, WriteBufResult};
@@ -10,7 +12,7 @@ use lazy_static::lazy_static;
use std::cmp::min;
use std::collections::HashMap;
use std::fs::OpenOptions;
use std::io::{Error, ErrorKind, Seek, SeekFrom, Write};
use std::io::{Error, ErrorKind};
use std::ops::DerefMut;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};
@@ -41,7 +43,7 @@ pub struct EphemeralFile {
_timelineid: ZTimelineId,
file: Arc<VirtualFile>,
pos: u64,
size: u64,
}
impl EphemeralFile {
@@ -70,11 +72,11 @@ impl EphemeralFile {
_tenantid: tenantid,
_timelineid: timelineid,
file: file_rc,
pos: 0,
size: 0,
})
}
pub fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), Error> {
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), Error> {
let mut off = 0;
while off < PAGE_SZ {
let n = self
@@ -93,6 +95,26 @@ impl EphemeralFile {
}
Ok(())
}
fn get_buf_for_write(&self, blkno: u32) -> Result<page_cache::PageWriteGuard, Error> {
// Look up the right page
let cache = page_cache::get();
let mut write_guard = match cache.write_ephemeral_buf(self.file_id, blkno) {
WriteBufResult::Found(guard) => guard,
WriteBufResult::NotFound(mut guard) => {
// Read the page from disk into the buffer
// TODO: if we're overwriting the whole page, no need to read it in first
self.fill_buffer(guard.deref_mut(), blkno)?;
guard.mark_valid();
// And then fall through to modify it.
guard
}
};
write_guard.mark_dirty();
Ok(write_guard)
}
}
/// Does the given filename look like an ephemeral file?
@@ -167,48 +189,49 @@ impl FileExt for EphemeralFile {
}
}
impl Write for EphemeralFile {
fn write(&mut self, buf: &[u8]) -> Result<usize, Error> {
let n = self.write_at(buf, self.pos)?;
self.pos += n as u64;
Ok(n)
}
impl BlobWriter for EphemeralFile {
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
let pos = self.size;
fn flush(&mut self) -> Result<(), std::io::Error> {
// we don't need to flush data:
// * we either write input bytes or not, not keeping any intermediate data buffered
// * rust unix file `flush` impl does not flush things either, returning `Ok(())`
Ok(())
}
}
let mut blknum = (self.size / PAGE_SZ as u64) as u32;
let mut off = (pos % PAGE_SZ as u64) as usize;
impl Seek for EphemeralFile {
fn seek(&mut self, pos: SeekFrom) -> Result<u64, Error> {
match pos {
SeekFrom::Start(offset) => {
self.pos = offset;
}
SeekFrom::End(_offset) => {
return Err(Error::new(
ErrorKind::Other,
"SeekFrom::End not supported by EphemeralFile",
));
}
SeekFrom::Current(offset) => {
let pos = self.pos as i128 + offset as i128;
if pos < 0 {
return Err(Error::new(
ErrorKind::InvalidInput,
"offset would be negative",
));
}
if pos > u64::MAX as i128 {
return Err(Error::new(ErrorKind::InvalidInput, "offset overflow"));
}
self.pos = pos as u64;
}
let mut buf = self.get_buf_for_write(blknum)?;
// Write the length field
let len_buf = u32::to_ne_bytes(srcbuf.len() as u32);
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it needs to be split across pages
buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]);
blknum += 1;
buf = self.get_buf_for_write(blknum)?;
buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]);
off = 4 - thislen;
} else {
buf[off..off + 4].copy_from_slice(&len_buf);
off += 4;
}
Ok(self.pos)
// Write the payload
let mut buf_remain = srcbuf;
while !buf_remain.is_empty() {
let mut page_remain = PAGE_SZ - off;
if page_remain == 0 {
blknum += 1;
buf = self.get_buf_for_write(blknum)?;
off = 0;
page_remain = PAGE_SZ;
}
let this_blk_len = min(page_remain, buf_remain.len());
buf[off..(off + this_blk_len)].copy_from_slice(&buf_remain[..this_blk_len]);
off += this_blk_len;
buf_remain = &buf_remain[this_blk_len..];
}
drop(buf);
self.size += 4 + srcbuf.len() as u64;
Ok(pos)
}
}
@@ -239,11 +262,34 @@ pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), std::io::Er
}
}
impl BlockReader for EphemeralFile {
type BlockLease = page_cache::PageReadGuard<'static>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
// Look up the right page
let cache = page_cache::get();
loop {
match cache.read_ephemeral_buf(self.file_id, blknum) {
ReadBufResult::Found(guard) => return Ok(guard),
ReadBufResult::NotFound(mut write_guard) => {
// Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum)?;
write_guard.mark_valid();
// Swap for read lock
continue;
}
};
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use rand::seq::SliceRandom;
use rand::thread_rng;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter};
use crate::layered_repository::block_io::BlockCursor;
use rand::{seq::SliceRandom, thread_rng, RngCore};
use std::fs;
use std::str::FromStr;
@@ -281,19 +327,19 @@ mod tests {
fn test_ephemeral_files() -> Result<(), Error> {
let (conf, tenantid, timelineid) = repo_harness("ephemeral_files")?;
let mut file_a = EphemeralFile::create(conf, tenantid, timelineid)?;
let file_a = EphemeralFile::create(conf, tenantid, timelineid)?;
file_a.write_all(b"foo")?;
file_a.write_all_at(b"foo", 0)?;
assert_eq!("foo", read_string(&file_a, 0, 20)?);
file_a.write_all(b"bar")?;
file_a.write_all_at(b"bar", 3)?;
assert_eq!("foobar", read_string(&file_a, 0, 20)?);
// Open a lot of files, enough to cause some page evictions.
let mut efiles = Vec::new();
for fileno in 0..100 {
let mut efile = EphemeralFile::create(conf, tenantid, timelineid)?;
efile.write_all(format!("file {}", fileno).as_bytes())?;
let efile = EphemeralFile::create(conf, tenantid, timelineid)?;
efile.write_all_at(format!("file {}", fileno).as_bytes(), 0)?;
assert_eq!(format!("file {}", fileno), read_string(&efile, 0, 10)?);
efiles.push((fileno, efile));
}
@@ -307,4 +353,41 @@ mod tests {
Ok(())
}
#[test]
fn test_ephemeral_blobs() -> Result<(), Error> {
let (conf, tenantid, timelineid) = repo_harness("ephemeral_blobs")?;
let mut file = EphemeralFile::create(conf, tenantid, timelineid)?;
let pos_foo = file.write_blob(b"foo")?;
assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice());
let pos_bar = file.write_blob(b"bar")?;
assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice());
assert_eq!(b"bar", file.block_cursor().read_blob(pos_bar)?.as_slice());
let mut blobs = Vec::new();
for i in 0..10000 {
let data = Vec::from(format!("blob{}", i).as_bytes());
let pos = file.write_blob(&data)?;
blobs.push((pos, data));
}
let mut cursor = BlockCursor::new(&file);
for (pos, expected) in blobs {
let actual = cursor.read_blob(pos)?;
assert_eq!(actual, expected);
}
drop(cursor);
// Test a large blob that spans multiple pages
let mut large_data = Vec::new();
large_data.resize(20000, 0);
thread_rng().fill_bytes(&mut large_data);
let pos_large = file.write_blob(&large_data)?;
let result = file.block_cursor().read_blob(pos_large)?;
assert_eq!(result, large_data);
Ok(())
}
}

View File

@@ -2,29 +2,50 @@
//! Helper functions for dealing with filenames of the image and delta layer files.
//!
use crate::config::PageServerConf;
use crate::layered_repository::storage_layer::SegmentTag;
use crate::relish::*;
use crate::repository::Key;
use std::cmp::Ordering;
use std::fmt;
use std::ops::Range;
use std::path::PathBuf;
use zenith_utils::lsn::Lsn;
// Note: LayeredTimeline::load_layer_map() relies on this sort order
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct DeltaFileName {
pub seg: SegmentTag,
pub start_lsn: Lsn,
pub end_lsn: Lsn,
pub dropped: bool,
pub key_range: Range<Key>,
pub lsn_range: Range<Lsn>,
}
impl PartialOrd for DeltaFileName {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for DeltaFileName {
fn cmp(&self, other: &Self) -> Ordering {
let mut cmp = self.key_range.start.cmp(&other.key_range.start);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.key_range.end.cmp(&other.key_range.end);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.lsn_range.start.cmp(&other.lsn_range.start);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.lsn_range.end.cmp(&other.lsn_range.end);
cmp
}
}
/// Represents the filename of a DeltaLayer
///
/// <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<start LSN>_<end LSN>
///
/// or if it was dropped:
///
/// <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<start LSN>_<end LSN>_DROPPED
/// <key start>-<key end>__<LSN start>-<LSN end>
///
impl DeltaFileName {
///
@@ -32,234 +53,121 @@ impl DeltaFileName {
/// match the expected pattern.
///
pub fn parse_str(fname: &str) -> Option<Self> {
let rel;
let mut parts;
if let Some(rest) = fname.strip_prefix("rel_") {
parts = rest.split('_');
rel = RelishTag::Relation(RelTag {
spcnode: parts.next()?.parse::<u32>().ok()?,
dbnode: parts.next()?.parse::<u32>().ok()?,
relnode: parts.next()?.parse::<u32>().ok()?,
forknum: parts.next()?.parse::<u8>().ok()?,
});
} else if let Some(rest) = fname.strip_prefix("pg_xact_") {
parts = rest.split('_');
rel = RelishTag::Slru {
slru: SlruKind::Clog,
segno: u32::from_str_radix(parts.next()?, 16).ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_multixact_members_") {
parts = rest.split('_');
rel = RelishTag::Slru {
slru: SlruKind::MultiXactMembers,
segno: u32::from_str_radix(parts.next()?, 16).ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_multixact_offsets_") {
parts = rest.split('_');
rel = RelishTag::Slru {
slru: SlruKind::MultiXactOffsets,
segno: u32::from_str_radix(parts.next()?, 16).ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_filenodemap_") {
parts = rest.split('_');
rel = RelishTag::FileNodeMap {
spcnode: parts.next()?.parse::<u32>().ok()?,
dbnode: parts.next()?.parse::<u32>().ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_twophase_") {
parts = rest.split('_');
rel = RelishTag::TwoPhase {
xid: parts.next()?.parse::<u32>().ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_control_checkpoint_") {
parts = rest.split('_');
rel = RelishTag::Checkpoint;
} else if let Some(rest) = fname.strip_prefix("pg_control_") {
parts = rest.split('_');
rel = RelishTag::ControlFile;
} else {
let mut parts = fname.split("__");
let mut key_parts = parts.next()?.split('-');
let mut lsn_parts = parts.next()?.split('-');
let key_start_str = key_parts.next()?;
let key_end_str = key_parts.next()?;
let lsn_start_str = lsn_parts.next()?;
let lsn_end_str = lsn_parts.next()?;
if parts.next().is_some() || key_parts.next().is_some() || key_parts.next().is_some() {
return None;
}
let segno = parts.next()?.parse::<u32>().ok()?;
let key_start = Key::from_hex(key_start_str).ok()?;
let key_end = Key::from_hex(key_end_str).ok()?;
let seg = SegmentTag { rel, segno };
let start_lsn = Lsn::from_hex(lsn_start_str).ok()?;
let end_lsn = Lsn::from_hex(lsn_end_str).ok()?;
let start_lsn = Lsn::from_hex(parts.next()?).ok()?;
let end_lsn = Lsn::from_hex(parts.next()?).ok()?;
let mut dropped = false;
if let Some(suffix) = parts.next() {
if suffix == "DROPPED" {
dropped = true;
} else {
return None;
}
}
if parts.next().is_some() {
if start_lsn >= end_lsn {
return None;
// or panic?
}
if key_start >= key_end {
return None;
// or panic?
}
Some(DeltaFileName {
seg,
start_lsn,
end_lsn,
dropped,
key_range: key_start..key_end,
lsn_range: start_lsn..end_lsn,
})
}
}
impl fmt::Display for DeltaFileName {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let basename = match self.seg.rel {
RelishTag::Relation(reltag) => format!(
"rel_{}_{}_{}_{}",
reltag.spcnode, reltag.dbnode, reltag.relnode, reltag.forknum
),
RelishTag::Slru {
slru: SlruKind::Clog,
segno,
} => format!("pg_xact_{:04X}", segno),
RelishTag::Slru {
slru: SlruKind::MultiXactMembers,
segno,
} => format!("pg_multixact_members_{:04X}", segno),
RelishTag::Slru {
slru: SlruKind::MultiXactOffsets,
segno,
} => format!("pg_multixact_offsets_{:04X}", segno),
RelishTag::FileNodeMap { spcnode, dbnode } => {
format!("pg_filenodemap_{}_{}", spcnode, dbnode)
}
RelishTag::TwoPhase { xid } => format!("pg_twophase_{}", xid),
RelishTag::Checkpoint => "pg_control_checkpoint".to_string(),
RelishTag::ControlFile => "pg_control".to_string(),
};
write!(
f,
"{}_{}_{:016X}_{:016X}{}",
basename,
self.seg.segno,
u64::from(self.start_lsn),
u64::from(self.end_lsn),
if self.dropped { "_DROPPED" } else { "" }
"{}-{}__{:016X}-{:016X}",
self.key_range.start,
self.key_range.end,
u64::from(self.lsn_range.start),
u64::from(self.lsn_range.end),
)
}
}
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct ImageFileName {
pub seg: SegmentTag,
pub key_range: Range<Key>,
pub lsn: Lsn,
}
impl PartialOrd for ImageFileName {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for ImageFileName {
fn cmp(&self, other: &Self) -> Ordering {
let mut cmp = self.key_range.start.cmp(&other.key_range.start);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.key_range.end.cmp(&other.key_range.end);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.lsn.cmp(&other.lsn);
cmp
}
}
///
/// Represents the filename of an ImageLayer
///
/// <spcnode>_<dbnode>_<relnode>_<forknum>_<seg>_<LSN>
///
/// <key start>-<key end>__<LSN>
impl ImageFileName {
///
/// Parse a string as an image file name. Returns None if the filename does not
/// match the expected pattern.
///
pub fn parse_str(fname: &str) -> Option<Self> {
let rel;
let mut parts;
if let Some(rest) = fname.strip_prefix("rel_") {
parts = rest.split('_');
rel = RelishTag::Relation(RelTag {
spcnode: parts.next()?.parse::<u32>().ok()?,
dbnode: parts.next()?.parse::<u32>().ok()?,
relnode: parts.next()?.parse::<u32>().ok()?,
forknum: parts.next()?.parse::<u8>().ok()?,
});
} else if let Some(rest) = fname.strip_prefix("pg_xact_") {
parts = rest.split('_');
rel = RelishTag::Slru {
slru: SlruKind::Clog,
segno: u32::from_str_radix(parts.next()?, 16).ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_multixact_members_") {
parts = rest.split('_');
rel = RelishTag::Slru {
slru: SlruKind::MultiXactMembers,
segno: u32::from_str_radix(parts.next()?, 16).ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_multixact_offsets_") {
parts = rest.split('_');
rel = RelishTag::Slru {
slru: SlruKind::MultiXactOffsets,
segno: u32::from_str_radix(parts.next()?, 16).ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_filenodemap_") {
parts = rest.split('_');
rel = RelishTag::FileNodeMap {
spcnode: parts.next()?.parse::<u32>().ok()?,
dbnode: parts.next()?.parse::<u32>().ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_twophase_") {
parts = rest.split('_');
rel = RelishTag::TwoPhase {
xid: parts.next()?.parse::<u32>().ok()?,
};
} else if let Some(rest) = fname.strip_prefix("pg_control_checkpoint_") {
parts = rest.split('_');
rel = RelishTag::Checkpoint;
} else if let Some(rest) = fname.strip_prefix("pg_control_") {
parts = rest.split('_');
rel = RelishTag::ControlFile;
} else {
let mut parts = fname.split("__");
let mut key_parts = parts.next()?.split('-');
let key_start_str = key_parts.next()?;
let key_end_str = key_parts.next()?;
let lsn_str = parts.next()?;
if parts.next().is_some() || key_parts.next().is_some() {
return None;
}
let segno = parts.next()?.parse::<u32>().ok()?;
let key_start = Key::from_hex(key_start_str).ok()?;
let key_end = Key::from_hex(key_end_str).ok()?;
let seg = SegmentTag { rel, segno };
let lsn = Lsn::from_hex(lsn_str).ok()?;
let lsn = Lsn::from_hex(parts.next()?).ok()?;
if parts.next().is_some() {
return None;
}
Some(ImageFileName { seg, lsn })
Some(ImageFileName {
key_range: key_start..key_end,
lsn,
})
}
}
impl fmt::Display for ImageFileName {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let basename = match self.seg.rel {
RelishTag::Relation(reltag) => format!(
"rel_{}_{}_{}_{}",
reltag.spcnode, reltag.dbnode, reltag.relnode, reltag.forknum
),
RelishTag::Slru {
slru: SlruKind::Clog,
segno,
} => format!("pg_xact_{:04X}", segno),
RelishTag::Slru {
slru: SlruKind::MultiXactMembers,
segno,
} => format!("pg_multixact_members_{:04X}", segno),
RelishTag::Slru {
slru: SlruKind::MultiXactOffsets,
segno,
} => format!("pg_multixact_offsets_{:04X}", segno),
RelishTag::FileNodeMap { spcnode, dbnode } => {
format!("pg_filenodemap_{}_{}", spcnode, dbnode)
}
RelishTag::TwoPhase { xid } => format!("pg_twophase_{}", xid),
RelishTag::Checkpoint => "pg_control_checkpoint".to_string(),
RelishTag::ControlFile => "pg_control".to_string(),
};
write!(
f,
"{}_{}_{:016X}",
basename,
self.seg.segno,
"{}-{}__{:016X}",
self.key_range.start,
self.key_range.end,
u64::from(self.lsn),
)
}

View File

@@ -1,142 +0,0 @@
//!
//! Global registry of open layers.
//!
//! Whenever a new in-memory layer is created to hold incoming WAL, it is registered
//! in [`GLOBAL_LAYER_MAP`], so that we can keep track of the total number of
//! in-memory layers in the system, and know when we need to evict some to release
//! memory.
//!
//! Each layer is assigned a unique ID when it's registered in the global registry.
//! The ID can be used to relocate the layer later, without having to hold locks.
//!
use std::sync::atomic::{AtomicU8, Ordering};
use std::sync::{Arc, RwLock};
use super::inmemory_layer::InMemoryLayer;
use lazy_static::lazy_static;
const MAX_USAGE_COUNT: u8 = 5;
lazy_static! {
pub static ref GLOBAL_LAYER_MAP: RwLock<InMemoryLayers> =
RwLock::new(InMemoryLayers::default());
}
// TODO these types can probably be smaller
#[derive(PartialEq, Eq, Clone, Copy)]
pub struct LayerId {
index: usize,
tag: u64, // to avoid ABA problem
}
enum SlotData {
Occupied(Arc<InMemoryLayer>),
/// Vacant slots form a linked list, the value is the index
/// of the next vacant slot in the list.
Vacant(Option<usize>),
}
struct Slot {
tag: u64,
data: SlotData,
usage_count: AtomicU8, // for clock algorithm
}
#[derive(Default)]
pub struct InMemoryLayers {
slots: Vec<Slot>,
num_occupied: usize,
// Head of free-slot list.
next_empty_slot_idx: Option<usize>,
}
impl InMemoryLayers {
pub fn insert(&mut self, layer: Arc<InMemoryLayer>) -> LayerId {
let slot_idx = match self.next_empty_slot_idx {
Some(slot_idx) => slot_idx,
None => {
let idx = self.slots.len();
self.slots.push(Slot {
tag: 0,
data: SlotData::Vacant(None),
usage_count: AtomicU8::new(0),
});
idx
}
};
let slots_len = self.slots.len();
let slot = &mut self.slots[slot_idx];
match slot.data {
SlotData::Occupied(_) => {
panic!("an occupied slot was in the free list");
}
SlotData::Vacant(next_empty_slot_idx) => {
self.next_empty_slot_idx = next_empty_slot_idx;
}
}
slot.data = SlotData::Occupied(layer);
slot.usage_count.store(1, Ordering::Relaxed);
self.num_occupied += 1;
assert!(self.num_occupied <= slots_len);
LayerId {
index: slot_idx,
tag: slot.tag,
}
}
pub fn get(&self, layer_id: &LayerId) -> Option<Arc<InMemoryLayer>> {
let slot = self.slots.get(layer_id.index)?; // TODO should out of bounds indexes just panic?
if slot.tag != layer_id.tag {
return None;
}
if let SlotData::Occupied(layer) = &slot.data {
let _ = slot.usage_count.fetch_update(
Ordering::Relaxed,
Ordering::Relaxed,
|old_usage_count| {
if old_usage_count < MAX_USAGE_COUNT {
Some(old_usage_count + 1)
} else {
None
}
},
);
Some(Arc::clone(layer))
} else {
None
}
}
// TODO this won't be a public API in the future
pub fn remove(&mut self, layer_id: &LayerId) {
let slot = &mut self.slots[layer_id.index];
if slot.tag != layer_id.tag {
return;
}
match &slot.data {
SlotData::Occupied(_layer) => {
// TODO evict the layer
}
SlotData::Vacant(_) => unimplemented!(),
}
slot.data = SlotData::Vacant(self.next_empty_slot_idx);
self.next_empty_slot_idx = Some(layer_id.index);
assert!(self.num_occupied > 0);
self.num_occupied -= 1;
slot.tag = slot.tag.wrapping_add(1);
}
}

View File

@@ -1,86 +1,96 @@
//! An ImageLayer represents an image or a snapshot of a segment at one particular LSN.
//! It is stored in a file on disk.
//! An ImageLayer represents an image or a snapshot of a key-range at
//! one particular LSN. It contains an image of all key-value pairs
//! in its key-range. Any key that falls into the image layer's range
//! but does not exist in the layer, does not exist.
//!
//! On disk, the image files are stored in timelines/<timelineid> directory.
//! Currently, there are no subdirectories, and each image layer file is named like this:
//! An image layer is stored in a file on disk. The file is stored in
//! timelines/<timelineid> directory. Currently, there are no
//! subdirectories, and each image layer file is named like this:
//!
//! Note that segno is
//! <spcnode>_<dbnode>_<relnode>_<forknum>_<segno>_<LSN>
//! <key start>-<key end>__<LSN>
//!
//! For example:
//!
//! 1663_13990_2609_0_5_000000000169C348
//!
//! An image file is constructed using the 'bookfile' crate.
//!
//! Only metadata is loaded into memory by the load function.
//! When images are needed, they are read directly from disk.
//!
//! For blocky relishes, the images are stored in BLOCKY_IMAGES_CHAPTER.
//! All the images are required to be BLOCK_SIZE, which allows for random access.
//!
//! For non-blocky relishes, the image can be found in NONBLOCKY_IMAGE_CHAPTER.
//! 000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568
//!
//! Every image layer file consists of three parts: "summary",
//! "index", and "values". The summary is a fixed size header at the
//! beginning of the file, and it contains basic information about the
//! layer, and offsets to the other parts. The "index" is a B-tree,
//! mapping from Key to an offset in the "values" part. The
//! actual page images are stored in the "values" part.
use crate::config::PageServerConf;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter};
use crate::layered_repository::block_io::{BlockBuf, BlockReader, FileBlockReader};
use crate::layered_repository::disk_btree::{DiskBtreeBuilder, DiskBtreeReader, VisitDirection};
use crate::layered_repository::filename::{ImageFileName, PathOrConf};
use crate::layered_repository::storage_layer::{
Layer, PageReconstructData, PageReconstructResult, SegmentBlk, SegmentTag,
Layer, ValueReconstructResult, ValueReconstructState,
};
use crate::layered_repository::RELISH_SEG_SIZE;
use crate::page_cache::PAGE_SZ;
use crate::repository::{Key, Value, KEY_SIZE};
use crate::virtual_file::VirtualFile;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{anyhow, bail, ensure, Context, Result};
use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
use log::*;
use hex;
use serde::{Deserialize, Serialize};
use std::convert::TryInto;
use std::fs;
use std::io::{BufWriter, Write};
use std::io::Write;
use std::io::{Seek, SeekFrom};
use std::ops::Range;
use std::path::{Path, PathBuf};
use std::sync::{Mutex, MutexGuard};
use bookfile::{Book, BookWriter, ChapterWriter};
use std::sync::{RwLock, RwLockReadGuard};
use tracing::*;
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
// Magic constant to identify a Zenith segment image file
pub const IMAGE_FILE_MAGIC: u32 = 0x5A616E01 + 1;
/// Contains each block in block # order
const BLOCKY_IMAGES_CHAPTER: u64 = 1;
const NONBLOCKY_IMAGE_CHAPTER: u64 = 2;
/// Contains the [`Summary`] struct
const SUMMARY_CHAPTER: u64 = 3;
///
/// Header stored in the beginning of the file
///
/// After this comes the 'values' part, starting on block 1. After that,
/// the 'index' starts at the block indicated by 'index_start_blk'
///
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
struct Summary {
/// Magic value to identify this as a zenith image file. Always IMAGE_FILE_MAGIC.
magic: u16,
format_version: u16,
tenantid: ZTenantId,
timelineid: ZTimelineId,
seg: SegmentTag,
key_range: Range<Key>,
lsn: Lsn,
/// Block number where the 'index' part of the file begins.
index_start_blk: u32,
/// Block within the 'index', where the B-tree root page is stored
index_root_blk: u32,
// the 'values' part starts after the summary header, on block 1.
}
impl From<&ImageLayer> for Summary {
fn from(layer: &ImageLayer) -> Self {
Self {
magic: IMAGE_FILE_MAGIC,
format_version: STORAGE_FORMAT_VERSION,
tenantid: layer.tenantid,
timelineid: layer.timelineid,
seg: layer.seg,
key_range: layer.key_range.clone(),
lsn: layer.lsn,
index_start_blk: 0,
index_root_blk: 0,
}
}
}
const BLOCK_SIZE: usize = 8192;
///
/// ImageLayer is the in-memory data structure associated with an on-disk image
/// file. We keep an ImageLayer in memory for each file, in the LayerMap. If a
/// layer is in "loaded" state, we have a copy of the file in memory, in 'inner'.
/// layer is in "loaded" state, we have a copy of the index in memory, in 'inner'.
/// Otherwise the struct is just a placeholder for a file that exists on disk,
/// and it needs to be loaded before using it in queries.
///
@@ -88,26 +98,24 @@ pub struct ImageLayer {
path_or_conf: PathOrConf,
pub tenantid: ZTenantId,
pub timelineid: ZTimelineId,
pub seg: SegmentTag,
pub key_range: Range<Key>,
// This entry contains an image of all pages as of this LSN
pub lsn: Lsn,
inner: Mutex<ImageLayerInner>,
}
#[derive(Clone)]
enum ImageType {
Blocky { num_blocks: SegmentBlk },
NonBlocky,
inner: RwLock<ImageLayerInner>,
}
pub struct ImageLayerInner {
/// If None, the 'image_type' has not been loaded into memory yet.
book: Option<Book<VirtualFile>>,
/// If false, the 'index' has not been loaded into memory yet.
loaded: bool,
/// Derived from filename and bookfile chapter metadata
image_type: ImageType,
// values copied from summary
index_start_blk: u32,
index_root_blk: u32,
/// Reader object for reading blocks from the file. (None if not loaded yet)
file: Option<FileBlockReader<VirtualFile>>,
}
impl Layer for ImageLayer {
@@ -123,99 +131,51 @@ impl Layer for ImageLayer {
self.timelineid
}
fn get_seg_tag(&self) -> SegmentTag {
self.seg
fn get_key_range(&self) -> Range<Key> {
self.key_range.clone()
}
fn is_dropped(&self) -> bool {
false
}
fn get_start_lsn(&self) -> Lsn {
self.lsn
}
fn get_end_lsn(&self) -> Lsn {
fn get_lsn_range(&self) -> Range<Lsn> {
// End-bound is exclusive
self.lsn + 1
self.lsn..(self.lsn + 1)
}
/// Look up given page in the file
fn get_page_reconstruct_data(
fn get_value_reconstruct_data(
&self,
blknum: SegmentBlk,
lsn: Lsn,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult> {
assert!((0..RELISH_SEG_SIZE).contains(&blknum));
assert!(lsn >= self.lsn);
match reconstruct_data.page_img {
Some((cached_lsn, _)) if self.lsn <= cached_lsn => {
return Ok(PageReconstructResult::Complete)
}
_ => {}
}
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
) -> anyhow::Result<ValueReconstructResult> {
assert!(self.key_range.contains(&key));
assert!(lsn_range.end >= self.lsn);
let inner = self.load()?;
let buf = match &inner.image_type {
ImageType::Blocky { num_blocks } => {
// Check if the request is beyond EOF
if blknum >= *num_blocks {
return Ok(PageReconstructResult::Missing(lsn));
}
let file = inner.file.as_ref().unwrap();
let tree_reader = DiskBtreeReader::new(inner.index_start_blk, inner.index_root_blk, file);
let mut buf = vec![0u8; BLOCK_SIZE];
let offset = BLOCK_SIZE as u64 * blknum as u64;
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf);
if let Some(offset) = tree_reader.get(&keybuf)? {
let blob = file.block_cursor().read_blob(offset).with_context(|| {
format!(
"failed to read value from data file {} at offset {}",
self.filename().display(),
offset
)
})?;
let value = Bytes::from(blob);
let chapter = inner
.book
.as_ref()
.unwrap()
.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
chapter.read_exact_at(&mut buf, offset).with_context(|| {
format!(
"failed to read page from data file {} at offset {}",
self.filename().display(),
offset
)
})?;
buf
}
ImageType::NonBlocky => {
ensure!(blknum == 0);
inner
.book
.as_ref()
.unwrap()
.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?
.into_vec()
}
};
reconstruct_data.page_img = Some((self.lsn, Bytes::from(buf)));
Ok(PageReconstructResult::Complete)
}
/// Get size of the segment
fn get_seg_size(&self, _lsn: Lsn) -> Result<SegmentBlk> {
let inner = self.load()?;
match inner.image_type {
ImageType::Blocky { num_blocks } => Ok(num_blocks),
ImageType::NonBlocky => Err(anyhow!("get_seg_size called for non-blocky segment")),
reconstruct_state.img = Some((self.lsn, value));
Ok(ValueReconstructResult::Complete)
} else {
Ok(ValueReconstructResult::Missing)
}
}
/// Does this segment exist at given LSN?
fn get_seg_exists(&self, _lsn: Lsn) -> Result<bool> {
Ok(true)
}
fn unload(&self) -> Result<()> {
Ok(())
fn iter(&self) -> Box<dyn Iterator<Item = Result<(Key, Lsn, Value)>>> {
todo!();
}
fn delete(&self) -> Result<()> {
@@ -233,26 +193,28 @@ impl Layer for ImageLayer {
}
/// debugging function to print out the contents of the layer
fn dump(&self) -> Result<()> {
fn dump(&self, verbose: bool) -> Result<()> {
println!(
"----- image layer for ten {} tli {} seg {} at {} ----",
self.tenantid, self.timelineid, self.seg, self.lsn
"----- image layer for ten {} tli {} key {}-{} at {} ----",
self.tenantid, self.timelineid, self.key_range.start, self.key_range.end, self.lsn
);
let inner = self.load()?;
match inner.image_type {
ImageType::Blocky { num_blocks } => println!("({}) blocks ", num_blocks),
ImageType::NonBlocky => {
let chapter = inner
.book
.as_ref()
.unwrap()
.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?;
println!("non-blocky ({} bytes)", chapter.len());
}
if !verbose {
return Ok(());
}
let inner = self.load()?;
let file = inner.file.as_ref().unwrap();
let tree_reader =
DiskBtreeReader::<_, KEY_SIZE>::new(inner.index_start_blk, inner.index_root_blk, file);
tree_reader.dump()?;
tree_reader.visit(&[0u8; KEY_SIZE], VisitDirection::Forwards, |key, value| {
println!("key: {} offset {}", hex::encode(key), value);
true
})?;
Ok(())
}
}
@@ -273,32 +235,55 @@ impl ImageLayer {
}
///
/// Load the contents of the file into memory
/// Open the underlying file and read the metadata into memory, if it's
/// not loaded already.
///
fn load(&self) -> Result<MutexGuard<ImageLayerInner>> {
// quick exit if already loaded
let mut inner = self.inner.lock().unwrap();
fn load(&self) -> Result<RwLockReadGuard<ImageLayerInner>> {
loop {
// Quick exit if already loaded
let inner = self.inner.read().unwrap();
if inner.loaded {
return Ok(inner);
}
if inner.book.is_some() {
return Ok(inner);
// Need to open the file and load the metadata. Upgrade our lock to
// a write lock. (Or rather, release and re-lock in write mode.)
drop(inner);
let mut inner = self.inner.write().unwrap();
if !inner.loaded {
self.load_inner(&mut inner)?;
} else {
// Another thread loaded it while we were not holding the lock.
}
// We now have the file open and loaded. There's no function to do
// that in the std library RwLock, so we have to release and re-lock
// in read mode. (To be precise, the lock guard was moved in the
// above call to `load_inner`, so it's already been released). And
// while we do that, another thread could unload again, so we have
// to re-check and retry if that happens.
drop(inner);
}
}
fn load_inner(&self, inner: &mut ImageLayerInner) -> Result<()> {
let path = self.path();
let file = VirtualFile::open(&path)
.with_context(|| format!("Failed to open virtual file '{}'", path.display()))?;
let book = Book::new(file).with_context(|| {
format!(
"Failed to open virtual file '{}' as a bookfile",
path.display()
)
})?;
// Open the file if it's not open already.
if inner.file.is_none() {
let file = VirtualFile::open(&path)
.with_context(|| format!("Failed to open file '{}'", path.display()))?;
inner.file = Some(FileBlockReader::new(file));
}
let file = inner.file.as_mut().unwrap();
let summary_blk = file.read_blk(0)?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
match &self.path_or_conf {
PathOrConf::Conf(_) => {
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let actual_summary = Summary::des(&chapter)?;
let expected_summary = Summary::from(self);
let mut expected_summary = Summary::from(self);
expected_summary.index_start_blk = actual_summary.index_start_blk;
expected_summary.index_root_blk = actual_summary.index_root_blk;
if actual_summary != expected_summary {
bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary);
@@ -318,25 +303,10 @@ impl ImageLayer {
}
}
let image_type = if self.seg.rel.is_blocky() {
let chapter = book.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
let images_len = chapter.len();
ensure!(images_len % BLOCK_SIZE as u64 == 0);
let num_blocks: SegmentBlk = (images_len / BLOCK_SIZE as u64).try_into()?;
ImageType::Blocky { num_blocks }
} else {
let _chapter = book.chapter_reader(NONBLOCKY_IMAGE_CHAPTER)?;
ImageType::NonBlocky
};
debug!("loaded from {}", &path.display());
*inner = ImageLayerInner {
book: Some(book),
image_type,
};
Ok(inner)
inner.index_start_blk = actual_summary.index_start_blk;
inner.index_root_blk = actual_summary.index_root_blk;
inner.loaded = true;
Ok(())
}
/// Create an ImageLayer struct representing an existing file on disk
@@ -350,11 +320,13 @@ impl ImageLayer {
path_or_conf: PathOrConf::Conf(conf),
timelineid,
tenantid,
seg: filename.seg,
key_range: filename.key_range.clone(),
lsn: filename.lsn,
inner: Mutex::new(ImageLayerInner {
book: None,
image_type: ImageType::Blocky { num_blocks: 0 },
inner: RwLock::new(ImageLayerInner {
loaded: false,
file: None,
index_start_blk: 0,
index_root_blk: 0,
}),
}
}
@@ -362,29 +334,33 @@ impl ImageLayer {
/// Create an ImageLayer struct representing an existing file on disk.
///
/// This variant is only used for debugging purposes, by the 'dump_layerfile' binary.
pub fn new_for_path<F>(path: &Path, book: &Book<F>) -> Result<ImageLayer>
pub fn new_for_path<F>(path: &Path, file: F) -> Result<ImageLayer>
where
F: std::os::unix::prelude::FileExt,
{
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let summary = Summary::des(&chapter)?;
let mut summary_buf = Vec::new();
summary_buf.resize(PAGE_SZ, 0);
file.read_exact_at(&mut summary_buf, 0)?;
let summary = Summary::des_prefix(&summary_buf)?;
Ok(ImageLayer {
path_or_conf: PathOrConf::Path(path.to_path_buf()),
timelineid: summary.timelineid,
tenantid: summary.tenantid,
seg: summary.seg,
key_range: summary.key_range,
lsn: summary.lsn,
inner: Mutex::new(ImageLayerInner {
book: None,
image_type: ImageType::Blocky { num_blocks: 0 },
inner: RwLock::new(ImageLayerInner {
file: None,
loaded: false,
index_start_blk: 0,
index_root_blk: 0,
}),
})
}
fn layer_name(&self) -> ImageFileName {
ImageFileName {
seg: self.seg,
key_range: self.key_range.clone(),
lsn: self.lsn,
}
}
@@ -406,22 +382,21 @@ impl ImageLayer {
///
/// 1. Create the ImageLayerWriter by calling ImageLayerWriter::new(...)
///
/// 2. Write the contents by calling `put_page_image` for every page
/// in the segment.
/// 2. Write the contents by calling `put_page_image` for every key-value
/// pair in the key range.
///
/// 3. Call `finish`.
///
pub struct ImageLayerWriter {
conf: &'static PageServerConf,
_path: PathBuf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
key_range: Range<Key>,
lsn: Lsn,
num_blocks: SegmentBlk,
page_image_writer: ChapterWriter<BufWriter<VirtualFile>>,
num_blocks_written: SegmentBlk,
blob_writer: WriteBlobWriter<VirtualFile>,
tree: DiskBtreeBuilder<BlockBuf, KEY_SIZE>,
}
impl ImageLayerWriter {
@@ -429,10 +404,9 @@ impl ImageLayerWriter {
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
key_range: &Range<Key>,
lsn: Lsn,
num_blocks: SegmentBlk,
) -> Result<ImageLayerWriter> {
) -> anyhow::Result<ImageLayerWriter> {
// Create the file
//
// Note: This overwrites any existing file. There shouldn't be any.
@@ -441,90 +415,92 @@ impl ImageLayerWriter {
&PathOrConf::Conf(conf),
timelineid,
tenantid,
&ImageFileName { seg, lsn },
&ImageFileName {
key_range: key_range.clone(),
lsn,
},
);
let file = VirtualFile::create(&path)?;
let buf_writer = BufWriter::new(file);
let book = BookWriter::new(buf_writer, IMAGE_FILE_MAGIC)?;
info!("new image layer {}", path.display());
let mut file = VirtualFile::create(&path)?;
// make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64))?;
let blob_writer = WriteBlobWriter::new(file, PAGE_SZ as u64);
// Open the page-images chapter for writing. The calls to
// `put_page_image` will use this to write the contents.
let chapter = if seg.rel.is_blocky() {
book.new_chapter(BLOCKY_IMAGES_CHAPTER)
} else {
assert_eq!(num_blocks, 1);
book.new_chapter(NONBLOCKY_IMAGE_CHAPTER)
};
// Initialize the b-tree index builder
let block_buf = BlockBuf::new();
let tree_builder = DiskBtreeBuilder::new(block_buf);
let writer = ImageLayerWriter {
conf,
_path: path,
timelineid,
tenantid,
seg,
key_range: key_range.clone(),
lsn,
num_blocks,
page_image_writer: chapter,
num_blocks_written: 0,
tree: tree_builder,
blob_writer,
};
Ok(writer)
}
///
/// Write next page image to the file.
/// Write next value to the file.
///
/// The page versions must be appended in blknum order.
///
pub fn put_page_image(&mut self, block_bytes: &[u8]) -> Result<()> {
assert!(self.num_blocks_written < self.num_blocks);
if self.seg.rel.is_blocky() {
assert_eq!(block_bytes.len(), BLOCK_SIZE);
}
self.page_image_writer.write_all(block_bytes)?;
self.num_blocks_written += 1;
pub fn put_image(&mut self, key: Key, img: &[u8]) -> Result<()> {
ensure!(self.key_range.contains(&key));
let off = self.blob_writer.write_blob(img)?;
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf);
self.tree.append(&keybuf, off)?;
Ok(())
}
pub fn finish(self) -> Result<ImageLayer> {
// Check that the `put_page_image' was called for every block.
assert!(self.num_blocks_written == self.num_blocks);
pub fn finish(self) -> anyhow::Result<ImageLayer> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
// Close the page-images chapter
let book = self.page_image_writer.close()?;
let mut file = self.blob_writer.into_inner();
// Write out the summary chapter
let image_type = if self.seg.rel.is_blocky() {
ImageType::Blocky {
num_blocks: self.num_blocks,
}
} else {
ImageType::NonBlocky
};
let mut chapter = book.new_chapter(SUMMARY_CHAPTER);
// Write out the index
file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))?;
let (index_root_blk, block_buf) = self.tree.finish()?;
for buf in block_buf.blocks {
file.write_all(buf.as_ref())?;
}
// Fill in the summary on blk 0
let summary = Summary {
magic: IMAGE_FILE_MAGIC,
format_version: STORAGE_FORMAT_VERSION,
tenantid: self.tenantid,
timelineid: self.timelineid,
seg: self.seg,
key_range: self.key_range.clone(),
lsn: self.lsn,
index_start_blk,
index_root_blk,
};
Summary::ser_into(&summary, &mut chapter)?;
let book = chapter.close()?;
// This flushes the underlying 'buf_writer'.
book.close()?;
file.seek(SeekFrom::Start(0))?;
Summary::ser_into(&summary, &mut file)?;
// Note: Because we open the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.book here. The first read will have to re-open it.
// set inner.file here. The first read will have to re-open it.
let layer = ImageLayer {
path_or_conf: PathOrConf::Conf(self.conf),
timelineid: self.timelineid,
tenantid: self.tenantid,
seg: self.seg,
key_range: self.key_range.clone(),
lsn: self.lsn,
inner: Mutex::new(ImageLayerInner {
book: None,
image_type,
inner: RwLock::new(ImageLayerInner {
loaded: false,
file: None,
index_start_blk,
index_root_blk,
}),
};
trace!("created image layer {}", layer.path().display());

View File

@@ -1,30 +1,29 @@
//! An in-memory layer stores recently received PageVersions.
//! The page versions are held in a BTreeMap. To avoid OOM errors, the map size is limited
//! and layers can be spilled to disk into ephemeral files.
//! An in-memory layer stores recently received key-value pairs.
//!
//! And there's another BTreeMap to track the size of the relation.
//! The "in-memory" part of the name is a bit misleading: the actual page versions are
//! held in an ephemeral file, not in memory. The metadata for each page version, i.e.
//! its position in the file, is kept in memory, though.
//!
use crate::config::PageServerConf;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter};
use crate::layered_repository::block_io::BlockReader;
use crate::layered_repository::delta_layer::{DeltaLayer, DeltaLayerWriter};
use crate::layered_repository::ephemeral_file::EphemeralFile;
use crate::layered_repository::filename::DeltaFileName;
use crate::layered_repository::image_layer::{ImageLayer, ImageLayerWriter};
use crate::layered_repository::storage_layer::{
Layer, PageReconstructData, PageReconstructResult, PageVersion, SegmentBlk, SegmentTag,
RELISH_SEG_SIZE,
Layer, ValueReconstructResult, ValueReconstructState,
};
use crate::layered_repository::LayeredTimeline;
use crate::layered_repository::ZERO_PAGE;
use crate::repository::ZenithWalRecord;
use crate::repository::{Key, Value};
use crate::walrecord;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{ensure, Result};
use bytes::Bytes;
use log::*;
use anyhow::{bail, ensure, Result};
use std::collections::HashMap;
use std::io::Seek;
use std::os::unix::fs::FileExt;
use tracing::*;
// avoid binding to Write (conflicts with std::io::Write)
// while being able to use std::fmt::Write's methods
use std::fmt::Write as _;
use std::ops::Range;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};
use std::sync::RwLock;
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
use zenith_utils::vec_map::VecMap;
@@ -33,7 +32,6 @@ pub struct InMemoryLayer {
conf: &'static PageServerConf,
tenantid: ZTenantId,
timelineid: ZTimelineId,
seg: SegmentTag,
///
/// This layer contains all the changes from 'start_lsn'. The
@@ -41,27 +39,9 @@ pub struct InMemoryLayer {
///
start_lsn: Lsn,
///
/// LSN of the oldest page version stored in this layer.
///
/// This is different from 'start_lsn' in that we enforce that the 'start_lsn'
/// of a layer always matches the 'end_lsn' of its predecessor, even if there
/// are no page versions until at a later LSN. That way you can detect any
/// missing layer files more easily. 'oldest_lsn' is the first page version
/// actually stored in this layer. In the range between 'start_lsn' and
/// 'oldest_lsn', there are no changes to the segment.
/// 'oldest_lsn' is used to adjust 'disk_consistent_lsn' and that is why it should
/// point to the beginning of WAL record. This is the other difference with 'start_lsn'
/// which points to end of WAL record. This is why 'oldest_lsn' can be smaller than 'start_lsn'.
///
oldest_lsn: Lsn,
/// The above fields never change. The parts that do change are in 'inner',
/// and protected by mutex.
inner: RwLock<InMemoryLayerInner>,
/// Predecessor layer might be needed?
incremental: bool,
}
pub struct InMemoryLayerInner {
@@ -69,98 +49,23 @@ pub struct InMemoryLayerInner {
/// Writes are only allowed when this is None
end_lsn: Option<Lsn>,
/// If this relation was dropped, remember when that happened.
/// The drop LSN is recorded in [`end_lsn`].
dropped: bool,
///
/// All versions of all pages in the layer are kept here. Indexed
/// by block number and LSN. The value is an offset into the
/// ephemeral file where the page version is stored.
///
index: HashMap<Key, VecMap<Lsn, u64>>,
/// The PageVersion structs are stored in a serialized format in this file.
/// Each serialized PageVersion is preceded by a 'u32' length field.
/// 'page_versions' map stores offsets into this file.
/// The values are stored in a serialized format in this file.
/// Each serialized Value is preceded by a 'u32' length field.
/// PerSeg::page_versions map stores offsets into this file.
file: EphemeralFile,
/// Metadata about all versions of all pages in the layer is kept
/// here. Indexed by block number and LSN. The value is an offset
/// into the ephemeral file where the page version is stored.
page_versions: HashMap<SegmentBlk, VecMap<Lsn, u64>>,
///
/// `seg_sizes` tracks the size of the segment at different points in time.
///
/// For a blocky rel, there is always one entry, at the layer's start_lsn,
/// so that determining the size never depends on the predecessor layer. For
/// a non-blocky rel, 'seg_sizes' is not used and is always empty.
///
seg_sizes: VecMap<Lsn, SegmentBlk>,
///
/// LSN of the newest page version stored in this layer.
///
/// The difference between 'end_lsn' and 'latest_lsn' is the same as between
/// 'start_lsn' and 'oldest_lsn'. See comments in 'oldest_lsn'.
///
latest_lsn: Lsn,
}
impl InMemoryLayerInner {
fn assert_writeable(&self) {
assert!(self.end_lsn.is_none());
}
fn get_seg_size(&self, lsn: Lsn) -> SegmentBlk {
// Scan the BTreeMap backwards, starting from the given entry.
let slice = self.seg_sizes.slice_range(..=lsn);
// We make sure there is always at least one entry
if let Some((_entry_lsn, entry)) = slice.last() {
*entry
} else {
panic!("could not find seg size in in-memory layer");
}
}
///
/// Read a page version from the ephemeral file.
///
fn read_pv(&self, off: u64) -> Result<PageVersion> {
let mut buf = Vec::new();
self.read_pv_bytes(off, &mut buf)?;
Ok(PageVersion::des(&buf)?)
}
///
/// Read a page version from the ephemeral file, as raw bytes, at
/// the given offset. The bytes are read into 'buf', which is
/// expanded if necessary. Returns the size of the page version.
///
fn read_pv_bytes(&self, off: u64, buf: &mut Vec<u8>) -> Result<usize> {
// read length
let mut lenbuf = [0u8; 4];
self.file.read_exact_at(&mut lenbuf, off)?;
let len = u32::from_ne_bytes(lenbuf) as usize;
if buf.len() < len {
buf.resize(len, 0);
}
self.file.read_exact_at(&mut buf[0..len], off + 4)?;
Ok(len)
}
fn write_pv(&mut self, pv: &PageVersion) -> Result<u64> {
// remember starting position
let pos = self.file.stream_position()?;
// make room for the 'length' field by writing zeros as a placeholder.
self.file.seek(std::io::SeekFrom::Start(pos + 4)).unwrap();
pv.ser_into(&mut self.file).unwrap();
// write the 'length' field.
let len = self.file.stream_position()? - pos - 4;
let lenbuf = u32::to_ne_bytes(len as u32);
self.file.write_all_at(&lenbuf, pos)?;
Ok(pos)
}
}
impl Layer for InMemoryLayer {
@@ -170,21 +75,12 @@ impl Layer for InMemoryLayer {
fn filename(&self) -> PathBuf {
let inner = self.inner.read().unwrap();
let end_lsn = if let Some(drop_lsn) = inner.end_lsn {
drop_lsn
} else {
Lsn(u64::MAX)
};
let end_lsn = inner.end_lsn.unwrap_or(Lsn(u64::MAX));
let delta_filename = DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn,
dropped: inner.dropped,
}
.to_string();
PathBuf::from(format!("inmem-{}", delta_filename))
PathBuf::from(format!(
"inmem-{:016X}-{:016X}",
self.start_lsn.0, end_lsn.0
))
}
fn get_tenant_id(&self) -> ZTenantId {
@@ -195,149 +91,90 @@ impl Layer for InMemoryLayer {
self.timelineid
}
fn get_seg_tag(&self) -> SegmentTag {
self.seg
fn get_key_range(&self) -> Range<Key> {
Key::MIN..Key::MAX
}
fn get_start_lsn(&self) -> Lsn {
self.start_lsn
}
fn get_end_lsn(&self) -> Lsn {
fn get_lsn_range(&self) -> Range<Lsn> {
let inner = self.inner.read().unwrap();
if let Some(end_lsn) = inner.end_lsn {
let end_lsn = if let Some(end_lsn) = inner.end_lsn {
end_lsn
} else {
Lsn(u64::MAX)
}
};
self.start_lsn..end_lsn
}
fn is_dropped(&self) -> bool {
let inner = self.inner.read().unwrap();
inner.dropped
}
/// Look up given page in the cache.
fn get_page_reconstruct_data(
/// Look up given value in the layer.
fn get_value_reconstruct_data(
&self,
blknum: SegmentBlk,
lsn: Lsn,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult> {
key: Key,
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
) -> anyhow::Result<ValueReconstructResult> {
ensure!(lsn_range.start <= self.start_lsn);
let mut need_image = true;
assert!((0..RELISH_SEG_SIZE).contains(&blknum));
let inner = self.inner.read().unwrap();
{
let inner = self.inner.read().unwrap();
let mut reader = inner.file.block_cursor();
// Scan the page versions backwards, starting from `lsn`.
if let Some(vec_map) = inner.page_versions.get(&blknum) {
let slice = vec_map.slice_range(..=lsn);
for (entry_lsn, pos) in slice.iter().rev() {
match &reconstruct_data.page_img {
Some((cached_lsn, _)) if entry_lsn <= cached_lsn => {
return Ok(PageReconstructResult::Complete)
}
_ => {}
// Scan the page versions backwards, starting from `lsn`.
if let Some(vec_map) = inner.index.get(&key) {
let slice = vec_map.slice_range(lsn_range);
for (entry_lsn, pos) in slice.iter().rev() {
match &reconstruct_state.img {
Some((cached_lsn, _)) if entry_lsn <= cached_lsn => {
return Ok(ValueReconstructResult::Complete)
}
_ => {}
}
let pv = inner.read_pv(*pos)?;
match pv {
PageVersion::Page(img) => {
reconstruct_data.page_img = Some((*entry_lsn, img));
let buf = reader.read_blob(*pos)?;
let value = Value::des(&buf)?;
match value {
Value::Image(img) => {
reconstruct_state.img = Some((*entry_lsn, img));
return Ok(ValueReconstructResult::Complete);
}
Value::WalRecord(rec) => {
let will_init = rec.will_init();
reconstruct_state.records.push((*entry_lsn, rec));
if will_init {
// This WAL record initializes the page, so no need to go further back
need_image = false;
break;
}
PageVersion::Wal(rec) => {
reconstruct_data.records.push((*entry_lsn, rec.clone()));
if rec.will_init() {
// This WAL record initializes the page, so no need to go further back
need_image = false;
break;
}
}
}
}
}
// If we didn't find any records for this, check if the request is beyond EOF
if need_image
&& reconstruct_data.records.is_empty()
&& self.seg.rel.is_blocky()
&& blknum >= self.get_seg_size(lsn)?
{
return Ok(PageReconstructResult::Missing(self.start_lsn));
}
// release lock on 'inner'
}
// release lock on 'inner'
// If an older page image is needed to reconstruct the page, let the
// caller know
// caller know.
if need_image {
if self.incremental {
Ok(PageReconstructResult::Continue(Lsn(self.start_lsn.0 - 1)))
} else {
Ok(PageReconstructResult::Missing(self.start_lsn))
}
Ok(ValueReconstructResult::Continue)
} else {
Ok(PageReconstructResult::Complete)
Ok(ValueReconstructResult::Complete)
}
}
/// Get size of the relation at given LSN
fn get_seg_size(&self, lsn: Lsn) -> Result<SegmentBlk> {
assert!(lsn >= self.start_lsn);
ensure!(
self.seg.rel.is_blocky(),
"get_seg_size() called on a non-blocky rel"
);
let inner = self.inner.read().unwrap();
Ok(inner.get_seg_size(lsn))
}
/// Does this segment exist at given LSN?
fn get_seg_exists(&self, lsn: Lsn) -> Result<bool> {
let inner = self.inner.read().unwrap();
// If the segment created after requested LSN,
// it doesn't exist in the layer. But we shouldn't
// have requested it in the first place.
assert!(lsn >= self.start_lsn);
// Is the requested LSN after the segment was dropped?
if inner.dropped {
if let Some(end_lsn) = inner.end_lsn {
if lsn >= end_lsn {
return Ok(false);
}
} else {
panic!("dropped in-memory layer with no end LSN");
}
}
// Otherwise, it exists
Ok(true)
}
/// Cannot unload anything in an in-memory layer, since there's no backing
/// store. To release memory used by an in-memory layer, use 'freeze' to turn
/// it into an on-disk layer.
fn unload(&self) -> Result<()> {
Ok(())
fn iter(&self) -> Box<dyn Iterator<Item = Result<(Key, Lsn, Value)>>> {
todo!();
}
/// Nothing to do here. When you drop the last reference to the layer, it will
/// be deallocated.
fn delete(&self) -> Result<()> {
panic!("can't delete an InMemoryLayer")
bail!("can't delete an InMemoryLayer")
}
fn is_incremental(&self) -> bool {
self.incremental
// in-memory layer is always considered incremental.
true
}
fn is_in_memory(&self) -> bool {
@@ -345,7 +182,7 @@ impl Layer for InMemoryLayer {
}
/// debugging function to print out the contents of the layer
fn dump(&self) -> Result<()> {
fn dump(&self, verbose: bool) -> Result<()> {
let inner = self.inner.read().unwrap();
let end_str = inner
@@ -355,29 +192,40 @@ impl Layer for InMemoryLayer {
.unwrap_or_default();
println!(
"----- in-memory layer for tli {} seg {} {}-{} {} ----",
self.timelineid, self.seg, self.start_lsn, end_str, inner.dropped,
"----- in-memory layer for tli {} LSNs {}-{} ----",
self.timelineid, self.start_lsn, end_str,
);
for (k, v) in inner.seg_sizes.as_slice() {
println!("seg_sizes {}: {}", k, v);
if !verbose {
return Ok(());
}
// List the blocks in order
let mut page_versions: Vec<(&SegmentBlk, &VecMap<Lsn, u64>)> =
inner.page_versions.iter().collect();
page_versions.sort_by_key(|k| k.0);
for (blknum, versions) in page_versions {
for (lsn, off) in versions.as_slice() {
let pv = inner.read_pv(*off);
let pv_description = match pv {
Ok(PageVersion::Page(_img)) => "page",
Ok(PageVersion::Wal(_rec)) => "wal",
Err(_err) => "INVALID",
};
println!("blk {} at {}: {}\n", blknum, lsn, pv_description);
let mut cursor = inner.file.block_cursor();
let mut buf = Vec::new();
for (key, vec_map) in inner.index.iter() {
for (lsn, pos) in vec_map.as_slice() {
let mut desc = String::new();
cursor.read_blob_into_buf(*pos, &mut buf)?;
let val = Value::des(&buf);
match val {
Ok(Value::Image(img)) => {
write!(&mut desc, " img {} bytes", img.len())?;
}
Ok(Value::WalRecord(rec)) => {
let wal_desc = walrecord::describe_wal_record(&rec);
write!(
&mut desc,
" rec {} bytes will_init: {} {}",
buf.len(),
rec.will_init(),
wal_desc
)?;
}
Err(err) => {
write!(&mut desc, " DESERIALIZATION ERROR: {}", err)?;
}
}
println!(" key {} at {}: {}", key, lsn, desc);
}
}
@@ -385,23 +233,7 @@ impl Layer for InMemoryLayer {
}
}
/// A result of an inmemory layer data being written to disk.
pub struct LayersOnDisk {
pub delta_layers: Vec<DeltaLayer>,
pub image_layers: Vec<ImageLayer>,
}
impl InMemoryLayer {
/// Return the oldest page version that's stored in this layer
pub fn get_oldest_lsn(&self) -> Lsn {
self.oldest_lsn
}
pub fn get_latest_lsn(&self) -> Lsn {
let inner = self.inner.read().unwrap();
inner.latest_lsn
}
///
/// Create a new, empty, in-memory layer
///
@@ -409,286 +241,77 @@ impl InMemoryLayer {
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
seg: SegmentTag,
start_lsn: Lsn,
oldest_lsn: Lsn,
) -> Result<InMemoryLayer> {
trace!(
"initializing new empty InMemoryLayer for writing {} on timeline {} at {}",
seg,
"initializing new empty InMemoryLayer for writing on timeline {} at {}",
timelineid,
start_lsn
);
// The segment is initially empty, so initialize 'seg_sizes' with 0.
let mut seg_sizes = VecMap::default();
if seg.rel.is_blocky() {
seg_sizes.append(start_lsn, 0).unwrap();
}
let file = EphemeralFile::create(conf, tenantid, timelineid)?;
Ok(InMemoryLayer {
conf,
timelineid,
tenantid,
seg,
start_lsn,
oldest_lsn,
incremental: false,
inner: RwLock::new(InMemoryLayerInner {
end_lsn: None,
dropped: false,
index: HashMap::new(),
file,
page_versions: HashMap::new(),
seg_sizes,
latest_lsn: oldest_lsn,
}),
})
}
// Write operations
/// Remember new page version, as a WAL record over previous version
pub fn put_wal_record(
&self,
lsn: Lsn,
blknum: SegmentBlk,
rec: ZenithWalRecord,
) -> Result<u32> {
self.put_page_version(blknum, lsn, PageVersion::Wal(rec))
}
/// Remember new page version, as a full page image
pub fn put_page_image(&self, blknum: SegmentBlk, lsn: Lsn, img: Bytes) -> Result<u32> {
self.put_page_version(blknum, lsn, PageVersion::Page(img))
}
/// Common subroutine of the public put_wal_record() and put_page_image() functions.
/// Adds the page version to the in-memory tree
pub fn put_page_version(&self, blknum: SegmentBlk, lsn: Lsn, pv: PageVersion) -> Result<u32> {
assert!((0..RELISH_SEG_SIZE).contains(&blknum));
trace!(
"put_page_version blk {} of {} at {}/{}",
blknum,
self.seg.rel,
self.timelineid,
lsn
);
pub fn put_value(&self, key: Key, lsn: Lsn, val: Value) -> Result<()> {
trace!("put_value key {} at {}/{}", key, self.timelineid, lsn);
let mut inner = self.inner.write().unwrap();
inner.assert_writeable();
assert!(lsn >= inner.latest_lsn);
inner.latest_lsn = lsn;
// Write the page version to the file, and remember its offset in 'page_versions'
{
let off = inner.write_pv(&pv)?;
let vec_map = inner.page_versions.entry(blknum).or_default();
let old = vec_map.append_or_update_last(lsn, off).unwrap().0;
if old.is_some() {
// We already had an entry for this LSN. That's odd..
warn!(
"Page version of rel {} blk {} at {} already exists",
self.seg.rel, blknum, lsn
);
}
}
// Also update the relation size, if this extended the relation.
if self.seg.rel.is_blocky() {
let newsize = blknum + 1;
// use inner get_seg_size, since calling self.get_seg_size will try to acquire the lock,
// which we've just acquired above
let oldsize = inner.get_seg_size(lsn);
if newsize > oldsize {
trace!(
"enlarging segment {} from {} to {} blocks at {}",
self.seg,
oldsize,
newsize,
lsn
);
// If we are extending the relation by more than one page, initialize the "gap"
// with zeros
//
// XXX: What if the caller initializes the gap with subsequent call with same LSN?
// I don't think that can happen currently, but that is highly dependent on how
// PostgreSQL writes its WAL records and there's no guarantee of it. If it does
// happen, we would hit the "page version already exists" warning above on the
// subsequent call to initialize the gap page.
for gapblknum in oldsize..blknum {
let zeropv = PageVersion::Page(ZERO_PAGE.clone());
trace!(
"filling gap blk {} with zeros for write of {}",
gapblknum,
blknum
);
// Write the page version to the file, and remember its offset in
// 'page_versions'
{
let off = inner.write_pv(&zeropv)?;
let vec_map = inner.page_versions.entry(gapblknum).or_default();
let old = vec_map.append_or_update_last(lsn, off).unwrap().0;
if old.is_some() {
warn!(
"Page version of seg {} blk {} at {} already exists",
self.seg, gapblknum, lsn
);
}
}
}
inner.seg_sizes.append_or_update_last(lsn, newsize).unwrap();
return Ok(newsize - oldsize);
}
}
Ok(0)
}
/// Remember that the relation was truncated at given LSN
pub fn put_truncation(&self, lsn: Lsn, new_size: SegmentBlk) {
assert!(
self.seg.rel.is_blocky(),
"put_truncation() called on a non-blocky rel"
);
let mut inner = self.inner.write().unwrap();
inner.assert_writeable();
// check that this we truncate to a smaller size than segment was before the truncation
let old_size = inner.get_seg_size(lsn);
assert!(new_size < old_size);
let (old, _delta_size) = inner
.seg_sizes
.append_or_update_last(lsn, new_size)
.unwrap();
let off = inner.file.write_blob(&Value::ser(&val)?)?;
let vec_map = inner.index.entry(key).or_default();
let old = vec_map.append_or_update_last(lsn, off).unwrap().0;
if old.is_some() {
// We already had an entry for this LSN. That's odd..
warn!("Inserting truncation, but had an entry for the LSN already");
}
}
/// Remember that the segment was dropped at given LSN
pub fn drop_segment(&self, lsn: Lsn) {
let mut inner = self.inner.write().unwrap();
assert!(inner.end_lsn.is_none());
assert!(!inner.dropped);
inner.dropped = true;
assert!(self.start_lsn < lsn);
inner.end_lsn = Some(lsn);
trace!("dropped segment {} at {}", self.seg, lsn);
}
///
/// Initialize a new InMemoryLayer for, by copying the state at the given
/// point in time from given existing layer.
///
pub fn create_successor_layer(
conf: &'static PageServerConf,
src: Arc<dyn Layer>,
timelineid: ZTimelineId,
tenantid: ZTenantId,
start_lsn: Lsn,
oldest_lsn: Lsn,
) -> Result<InMemoryLayer> {
let seg = src.get_seg_tag();
assert!(oldest_lsn.is_aligned());
trace!(
"initializing new InMemoryLayer for writing {} on timeline {} at {}",
seg,
timelineid,
start_lsn,
);
// Copy the segment size at the start LSN from the predecessor layer.
let mut seg_sizes = VecMap::default();
if seg.rel.is_blocky() {
let size = src.get_seg_size(start_lsn)?;
seg_sizes.append(start_lsn, size).unwrap();
warn!("Key {} at {} already exists", key, lsn);
}
let file = EphemeralFile::create(conf, tenantid, timelineid)?;
Ok(InMemoryLayer {
conf,
timelineid,
tenantid,
seg,
start_lsn,
oldest_lsn,
incremental: true,
inner: RwLock::new(InMemoryLayerInner {
end_lsn: None,
dropped: false,
file,
page_versions: HashMap::new(),
seg_sizes,
latest_lsn: oldest_lsn,
}),
})
Ok(())
}
pub fn is_writeable(&self) -> bool {
let inner = self.inner.read().unwrap();
inner.end_lsn.is_none()
pub fn put_tombstone(&self, _key_range: Range<Key>, _lsn: Lsn) -> Result<()> {
// TODO: Currently, we just leak the storage for any deleted keys
Ok(())
}
/// Make the layer non-writeable. Only call once.
/// Records the end_lsn for non-dropped layers.
/// `end_lsn` is inclusive
/// `end_lsn` is exclusive
pub fn freeze(&self, end_lsn: Lsn) {
let mut inner = self.inner.write().unwrap();
if inner.end_lsn.is_some() {
assert!(inner.dropped);
} else {
assert!(!inner.dropped);
assert!(self.start_lsn < end_lsn + 1);
inner.end_lsn = Some(Lsn(end_lsn.0 + 1));
assert!(self.start_lsn < end_lsn);
inner.end_lsn = Some(end_lsn);
if let Some((lsn, _)) = inner.seg_sizes.as_slice().last() {
assert!(lsn <= &end_lsn, "{:?} {:?}", lsn, end_lsn);
}
for (_blk, vec_map) in inner.page_versions.iter() {
for (lsn, _pos) in vec_map.as_slice() {
assert!(*lsn <= end_lsn);
}
for vec_map in inner.index.values() {
for (lsn, _pos) in vec_map.as_slice() {
assert!(*lsn < end_lsn);
}
}
}
/// Write the this frozen in-memory layer to disk.
/// Write this frozen in-memory layer to disk.
///
/// Returns new layers that replace this one.
/// If not dropped and reconstruct_pages is true, returns a new image layer containing the page versions
/// at the `end_lsn`. Can also return a DeltaLayer that includes all the
/// WAL records between start and end LSN. (The delta layer is not needed
/// when a new relish is created with a single LSN, so that the start and
/// end LSN are the same.)
pub fn write_to_disk(
&self,
timeline: &LayeredTimeline,
reconstruct_pages: bool,
) -> Result<LayersOnDisk> {
trace!(
"write_to_disk {} get_end_lsn is {}",
self.filename().display(),
self.get_end_lsn()
);
/// Returns a new delta layer with all the same data as this in-memory layer
pub fn write_to_disk(&self) -> Result<DeltaLayer> {
// Grab the lock in read-mode. We hold it over the I/O, but because this
// layer is not writeable anymore, no one should be trying to acquire the
// write lock on it, so we shouldn't block anyone. There's one exception
@@ -700,105 +323,32 @@ impl InMemoryLayer {
// rare though, so we just accept the potential latency hit for now.
let inner = self.inner.read().unwrap();
// Since `end_lsn` is exclusive, subtract 1 to calculate the last LSN
// that is included.
let end_lsn_exclusive = inner.end_lsn.unwrap();
let end_lsn_inclusive = Lsn(end_lsn_exclusive.0 - 1);
let mut delta_layer_writer = DeltaLayerWriter::new(
self.conf,
self.timelineid,
self.tenantid,
Key::MIN,
self.start_lsn..inner.end_lsn.unwrap(),
)?;
// Figure out if we should create a delta layer, image layer, or both.
let image_lsn: Option<Lsn>;
let delta_end_lsn: Option<Lsn>;
if self.is_dropped() || !reconstruct_pages {
// The segment was dropped. Create just a delta layer containing all the
// changes up to and including the drop.
delta_end_lsn = Some(end_lsn_exclusive);
image_lsn = None;
} else if self.start_lsn == end_lsn_inclusive {
// The layer contains exactly one LSN. It's enough to write an image
// layer at that LSN.
delta_end_lsn = None;
image_lsn = Some(end_lsn_inclusive);
} else {
// Create a delta layer with all the changes up to the end LSN,
// and an image layer at the end LSN.
//
// Note that we the delta layer does *not* include the page versions
// at the end LSN. They are included in the image layer, and there's
// no need to store them twice.
delta_end_lsn = Some(end_lsn_inclusive);
image_lsn = Some(end_lsn_inclusive);
}
let mut buf = Vec::new();
let mut delta_layers = Vec::new();
let mut image_layers = Vec::new();
let mut cursor = inner.file.block_cursor();
if let Some(delta_end_lsn) = delta_end_lsn {
let mut delta_layer_writer = DeltaLayerWriter::new(
self.conf,
self.timelineid,
self.tenantid,
self.seg,
self.start_lsn,
delta_end_lsn,
self.is_dropped(),
)?;
let mut keys: Vec<(&Key, &VecMap<Lsn, u64>)> = inner.index.iter().collect();
keys.sort_by_key(|k| k.0);
// Write all page versions, in block + LSN order
let mut buf: Vec<u8> = Vec::new();
let pv_iter = inner.page_versions.iter();
let mut pages: Vec<(&SegmentBlk, &VecMap<Lsn, u64>)> = pv_iter.collect();
pages.sort_by_key(|(blknum, _vec_map)| *blknum);
for (blknum, vec_map) in pages {
for (lsn, pos) in vec_map.as_slice() {
if *lsn < delta_end_lsn {
let len = inner.read_pv_bytes(*pos, &mut buf)?;
delta_layer_writer.put_page_version(*blknum, *lsn, &buf[..len])?;
}
}
for (key, vec_map) in keys.iter() {
let key = **key;
// Write all page versions
for (lsn, pos) in vec_map.as_slice() {
cursor.read_blob_into_buf(*pos, &mut buf)?;
let val = Value::des(&buf)?;
delta_layer_writer.put_value(key, *lsn, val)?;
}
// Create seg_sizes
let seg_sizes = if delta_end_lsn == end_lsn_exclusive {
inner.seg_sizes.clone()
} else {
inner.seg_sizes.split_at(&end_lsn_exclusive).0
};
let delta_layer = delta_layer_writer.finish(seg_sizes)?;
delta_layers.push(delta_layer);
}
drop(inner);
// Write a new base image layer at the cutoff point
if let Some(image_lsn) = image_lsn {
let size = if self.seg.rel.is_blocky() {
self.get_seg_size(image_lsn)?
} else {
1
};
let mut image_layer_writer = ImageLayerWriter::new(
self.conf,
self.timelineid,
self.tenantid,
self.seg,
image_lsn,
size,
)?;
for blknum in 0..size {
let img = timeline.materialize_page(self.seg, blknum, image_lsn, &*self)?;
image_layer_writer.put_page_image(&img)?;
}
let image_layer = image_layer_writer.finish()?;
image_layers.push(image_layer);
}
Ok(LayersOnDisk {
delta_layers,
image_layers,
})
let delta_layer = delta_layer_writer.finish(Key::MAX)?;
Ok(delta_layer)
}
}

View File

@@ -1,468 +0,0 @@
///
/// IntervalTree is data structure for holding intervals. It is generic
/// to make unit testing possible, but the only real user of it is the layer map,
///
/// It's inspired by the "segment tree" or a "statistic tree" as described in
/// https://en.wikipedia.org/wiki/Segment_tree. However, we use a B-tree to hold
/// the points instead of a binary tree. This is called an "interval tree" instead
/// of "segment tree" because the term "segment" is already using Zenith to mean
/// something else. To add to the confusion, there is another data structure known
/// as "interval tree" out there (see https://en.wikipedia.org/wiki/Interval_tree),
/// for storing intervals, but this isn't that.
///
/// The basic idea is to have a B-tree of "interesting Points". At each Point,
/// there is a list of intervals that contain the point. The Points are formed
/// from the start bounds of each interval; there is a Point for each distinct
/// start bound.
///
/// Operations:
///
/// To find intervals that contain a given point, you search the b-tree to find
/// the nearest Point <= search key. Then you just return the list of intervals.
///
/// To insert an interval, find the Point with start key equal to the inserted item.
/// If the Point doesn't exist yet, create it, by copying all the items from the
/// previous Point that cover the new Point. Then walk right, inserting the new
/// interval to all the Points that are contained by the new interval (including the
/// newly created Point).
///
/// To remove an interval, you scan the tree for all the Points that are contained by
/// the removed interval, and remove it from the list in each Point.
///
/// Requirements and assumptions:
///
/// - Can store overlapping items
/// - But there are not many overlapping items
/// - The interval bounds don't change after it is added to the tree
/// - Intervals are uniquely identified by pointer equality. You must not be insert the
/// same interval object twice, and `remove` uses pointer equality to remove the right
/// interval. It is OK to have two intervals with the same bounds, however.
///
use std::collections::BTreeMap;
use std::fmt::Debug;
use std::ops::Range;
use std::sync::Arc;
pub struct IntervalTree<I: ?Sized>
where
I: IntervalItem,
{
points: BTreeMap<I::Key, Point<I>>,
}
struct Point<I: ?Sized> {
/// All intervals that contain this point, in no particular order.
///
/// We assume that there aren't a lot of overlappingg intervals, so that this vector
/// never grows very large. If that assumption doesn't hold, we could keep this ordered
/// by the end bound, to speed up `search`. But as long as there are only a few elements,
/// a linear search is OK.
elements: Vec<Arc<I>>,
}
/// Abstraction for an interval that can be stored in the tree
///
/// The start bound is inclusive and the end bound is exclusive. End must be greater
/// than start.
pub trait IntervalItem {
type Key: Ord + Copy + Debug + Sized;
fn start_key(&self) -> Self::Key;
fn end_key(&self) -> Self::Key;
fn bounds(&self) -> Range<Self::Key> {
self.start_key()..self.end_key()
}
}
impl<I: ?Sized> IntervalTree<I>
where
I: IntervalItem,
{
/// Return an element that contains 'key', or precedes it.
///
/// If there are multiple candidates, returns the one with the highest 'end' key.
pub fn search(&self, key: I::Key) -> Option<Arc<I>> {
// Find the greatest point that precedes or is equal to the search key. If there is
// none, returns None.
let (_, p) = self.points.range(..=key).next_back()?;
// Find the element with the highest end key at this point
let highest_item = p
.elements
.iter()
.reduce(|a, b| {
// starting with Rust 1.53, could use `std::cmp::min_by_key` here
if a.end_key() > b.end_key() {
a
} else {
b
}
})
.unwrap();
Some(Arc::clone(highest_item))
}
/// Iterate over all items with start bound >= 'key'
pub fn iter_newer(&self, key: I::Key) -> IntervalIter<I> {
IntervalIter {
point_iter: self.points.range(key..),
elem_iter: None,
}
}
/// Iterate over all items
pub fn iter(&self) -> IntervalIter<I> {
IntervalIter {
point_iter: self.points.range(..),
elem_iter: None,
}
}
pub fn insert(&mut self, item: Arc<I>) {
let start_key = item.start_key();
let end_key = item.end_key();
assert!(start_key < end_key);
let bounds = start_key..end_key;
// Find the starting point and walk forward from there
let mut found_start_point = false;
let iter = self.points.range_mut(bounds);
for (point_key, point) in iter {
if *point_key == start_key {
found_start_point = true;
// It is an error to insert the same item to the tree twice.
assert!(
!point.elements.iter().any(|x| Arc::ptr_eq(x, &item)),
"interval is already in the tree"
);
}
point.elements.push(Arc::clone(&item));
}
if !found_start_point {
// Create a new Point for the starting point
// Look at the previous point, and copy over elements that overlap with this
// new point
let mut new_elements: Vec<Arc<I>> = Vec::new();
if let Some((_, prev_point)) = self.points.range(..start_key).next_back() {
let overlapping_prev_elements = prev_point
.elements
.iter()
.filter(|x| x.bounds().contains(&start_key))
.cloned();
new_elements.extend(overlapping_prev_elements);
}
new_elements.push(item);
let new_point = Point {
elements: new_elements,
};
self.points.insert(start_key, new_point);
}
}
pub fn remove(&mut self, item: &Arc<I>) {
// range search points
let start_key = item.start_key();
let end_key = item.end_key();
let bounds = start_key..end_key;
let mut points_to_remove: Vec<I::Key> = Vec::new();
let mut found_start_point = false;
for (point_key, point) in self.points.range_mut(bounds) {
if *point_key == start_key {
found_start_point = true;
}
let len_before = point.elements.len();
point.elements.retain(|other| !Arc::ptr_eq(other, item));
let len_after = point.elements.len();
assert_eq!(len_after + 1, len_before);
if len_after == 0 {
points_to_remove.push(*point_key);
}
}
assert!(found_start_point);
for k in points_to_remove {
self.points.remove(&k).unwrap();
}
}
}
pub struct IntervalIter<'a, I: ?Sized>
where
I: IntervalItem,
{
point_iter: std::collections::btree_map::Range<'a, I::Key, Point<I>>,
elem_iter: Option<(I::Key, std::slice::Iter<'a, Arc<I>>)>,
}
impl<'a, I> Iterator for IntervalIter<'a, I>
where
I: IntervalItem + ?Sized,
{
type Item = Arc<I>;
fn next(&mut self) -> Option<Self::Item> {
// Iterate over all elements in all the points in 'point_iter'. To avoid
// returning the same element twice, we only return each element at its
// starting point.
loop {
// Return next remaining element from the current point
if let Some((point_key, elem_iter)) = &mut self.elem_iter {
for elem in elem_iter {
if elem.start_key() == *point_key {
return Some(Arc::clone(elem));
}
}
}
// No more elements at this point. Move to next point.
if let Some((point_key, point)) = self.point_iter.next() {
self.elem_iter = Some((*point_key, point.elements.iter()));
continue;
} else {
// No more points, all done
return None;
}
}
}
}
impl<I: ?Sized> Default for IntervalTree<I>
where
I: IntervalItem,
{
fn default() -> Self {
IntervalTree {
points: BTreeMap::new(),
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::fmt;
#[derive(Debug)]
struct MockItem {
start_key: u32,
end_key: u32,
val: String,
}
impl IntervalItem for MockItem {
type Key = u32;
fn start_key(&self) -> u32 {
self.start_key
}
fn end_key(&self) -> u32 {
self.end_key
}
}
impl MockItem {
fn new(start_key: u32, end_key: u32) -> Self {
MockItem {
start_key,
end_key,
val: format!("{}-{}", start_key, end_key),
}
}
fn new_str(start_key: u32, end_key: u32, val: &str) -> Self {
MockItem {
start_key,
end_key,
val: format!("{}-{}: {}", start_key, end_key, val),
}
}
}
impl fmt::Display for MockItem {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{}", self.val)
}
}
#[rustfmt::skip]
fn assert_search(
tree: &IntervalTree<MockItem>,
key: u32,
expected: &[&str],
) -> Option<Arc<MockItem>> {
if let Some(v) = tree.search(key) {
let vstr = v.to_string();
assert!(!expected.is_empty(), "search with {} returned {}, expected None", key, v);
assert!(
expected.contains(&vstr.as_str()),
"search with {} returned {}, expected one of: {:?}",
key, v, expected,
);
Some(v)
} else {
assert!(
expected.is_empty(),
"search with {} returned None, expected one of {:?}",
key, expected
);
None
}
}
fn assert_contents(tree: &IntervalTree<MockItem>, expected: &[&str]) {
let mut contents: Vec<String> = tree.iter().map(|e| e.to_string()).collect();
contents.sort();
assert_eq!(contents, expected);
}
fn dump_tree(tree: &IntervalTree<MockItem>) {
for (point_key, point) in tree.points.iter() {
print!("{}:", point_key);
for e in point.elements.iter() {
print!(" {}", e);
}
println!();
}
}
#[test]
fn test_interval_tree_simple() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Simple, non-overlapping ranges.
tree.insert(Arc::new(MockItem::new(10, 11)));
tree.insert(Arc::new(MockItem::new(11, 12)));
tree.insert(Arc::new(MockItem::new(12, 13)));
tree.insert(Arc::new(MockItem::new(18, 19)));
tree.insert(Arc::new(MockItem::new(17, 18)));
tree.insert(Arc::new(MockItem::new(15, 16)));
assert_search(&tree, 9, &[]);
assert_search(&tree, 10, &["10-11"]);
assert_search(&tree, 11, &["11-12"]);
assert_search(&tree, 12, &["12-13"]);
assert_search(&tree, 13, &["12-13"]);
assert_search(&tree, 14, &["12-13"]);
assert_search(&tree, 15, &["15-16"]);
assert_search(&tree, 16, &["15-16"]);
assert_search(&tree, 17, &["17-18"]);
assert_search(&tree, 18, &["18-19"]);
assert_search(&tree, 19, &["18-19"]);
assert_search(&tree, 20, &["18-19"]);
// remove a few entries and search around them again
tree.remove(&assert_search(&tree, 10, &["10-11"]).unwrap()); // first entry
tree.remove(&assert_search(&tree, 12, &["12-13"]).unwrap()); // entry in the middle
tree.remove(&assert_search(&tree, 18, &["18-19"]).unwrap()); // last entry
assert_search(&tree, 9, &[]);
assert_search(&tree, 10, &[]);
assert_search(&tree, 11, &["11-12"]);
assert_search(&tree, 12, &["11-12"]);
assert_search(&tree, 14, &["11-12"]);
assert_search(&tree, 15, &["15-16"]);
assert_search(&tree, 17, &["17-18"]);
assert_search(&tree, 18, &["17-18"]);
}
#[test]
fn test_interval_tree_overlap() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Overlapping items
tree.insert(Arc::new(MockItem::new(22, 24)));
tree.insert(Arc::new(MockItem::new(23, 25)));
let x24_26 = Arc::new(MockItem::new(24, 26));
tree.insert(Arc::clone(&x24_26));
let x26_28 = Arc::new(MockItem::new(26, 28));
tree.insert(Arc::clone(&x26_28));
tree.insert(Arc::new(MockItem::new(25, 27)));
assert_search(&tree, 22, &["22-24"]);
assert_search(&tree, 23, &["22-24", "23-25"]);
assert_search(&tree, 24, &["23-25", "24-26"]);
assert_search(&tree, 25, &["24-26", "25-27"]);
assert_search(&tree, 26, &["25-27", "26-28"]);
assert_search(&tree, 27, &["26-28"]);
assert_search(&tree, 28, &["26-28"]);
assert_search(&tree, 29, &["26-28"]);
tree.remove(&x24_26);
tree.remove(&x26_28);
assert_search(&tree, 23, &["22-24", "23-25"]);
assert_search(&tree, 24, &["23-25"]);
assert_search(&tree, 25, &["25-27"]);
assert_search(&tree, 26, &["25-27"]);
assert_search(&tree, 27, &["25-27"]);
assert_search(&tree, 28, &["25-27"]);
assert_search(&tree, 29, &["25-27"]);
}
#[test]
fn test_interval_tree_nested() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Items containing other items
tree.insert(Arc::new(MockItem::new(31, 39)));
tree.insert(Arc::new(MockItem::new(32, 34)));
tree.insert(Arc::new(MockItem::new(33, 35)));
tree.insert(Arc::new(MockItem::new(30, 40)));
assert_search(&tree, 30, &["30-40"]);
assert_search(&tree, 31, &["30-40", "31-39"]);
assert_search(&tree, 32, &["30-40", "32-34", "31-39"]);
assert_search(&tree, 33, &["30-40", "32-34", "33-35", "31-39"]);
assert_search(&tree, 34, &["30-40", "33-35", "31-39"]);
assert_search(&tree, 35, &["30-40", "31-39"]);
assert_search(&tree, 36, &["30-40", "31-39"]);
assert_search(&tree, 37, &["30-40", "31-39"]);
assert_search(&tree, 38, &["30-40", "31-39"]);
assert_search(&tree, 39, &["30-40"]);
assert_search(&tree, 40, &["30-40"]);
assert_search(&tree, 41, &["30-40"]);
}
#[test]
fn test_interval_tree_duplicates() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Duplicate keys
let item_a = Arc::new(MockItem::new_str(55, 56, "a"));
tree.insert(Arc::clone(&item_a));
let item_b = Arc::new(MockItem::new_str(55, 56, "b"));
tree.insert(Arc::clone(&item_b));
let item_c = Arc::new(MockItem::new_str(55, 56, "c"));
tree.insert(Arc::clone(&item_c));
let item_d = Arc::new(MockItem::new_str(54, 56, "d"));
tree.insert(Arc::clone(&item_d));
let item_e = Arc::new(MockItem::new_str(55, 57, "e"));
tree.insert(Arc::clone(&item_e));
dump_tree(&tree);
assert_search(
&tree,
55,
&["55-56: a", "55-56: b", "55-56: c", "54-56: d", "55-57: e"],
);
tree.remove(&item_b);
dump_tree(&tree);
assert_contents(&tree, &["54-56: d", "55-56: a", "55-56: c", "55-57: e"]);
tree.remove(&item_d);
dump_tree(&tree);
assert_contents(&tree, &["55-56: a", "55-56: c", "55-57: e"]);
}
#[test]
#[should_panic]
fn test_interval_tree_insert_twice() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Inserting the same item twice is not cool
let item = Arc::new(MockItem::new(1, 2));
tree.insert(Arc::clone(&item));
tree.insert(Arc::clone(&item)); // fails assertion
}
}

View File

@@ -1,32 +1,29 @@
//!
//! The layer map tracks what layers exist for all the relishes in a timeline.
//! The layer map tracks what layers exist in a timeline.
//!
//! When the timeline is first accessed, the server lists of all layer files
//! in the timelines/<timelineid> directory, and populates this map with
//! ImageLayer and DeltaLayer structs corresponding to each file. When new WAL
//! is received, we create InMemoryLayers to hold the incoming records. Now and
//! then, in the checkpoint() function, the in-memory layers are frozen, forming
//! new image and delta layers and corresponding files are written to disk.
//! ImageLayer and DeltaLayer structs corresponding to each file. When the first
//! new WAL record is received, we create an InMemoryLayer to hold the incoming
//! records. Now and then, in the checkpoint() function, the in-memory layer is
//! are frozen, and it is split up into new image and delta layers and the
//! corresponding files are written to disk.
//!
use crate::layered_repository::interval_tree::{IntervalItem, IntervalIter, IntervalTree};
use crate::layered_repository::storage_layer::{Layer, SegmentTag};
use crate::layered_repository::storage_layer::Layer;
use crate::layered_repository::storage_layer::{range_eq, range_overlaps};
use crate::layered_repository::InMemoryLayer;
use crate::relish::*;
use crate::repository::Key;
use anyhow::Result;
use lazy_static::lazy_static;
use std::cmp::Ordering;
use std::collections::{BinaryHeap, HashMap};
use std::collections::VecDeque;
use std::ops::Range;
use std::sync::Arc;
use tracing::*;
use zenith_metrics::{register_int_gauge, IntGauge};
use zenith_utils::lsn::Lsn;
use super::global_layer_map::{LayerId, GLOBAL_LAYER_MAP};
lazy_static! {
static ref NUM_INMEMORY_LAYERS: IntGauge =
register_int_gauge!("pageserver_inmemory_layers", "Number of layers in memory")
.expect("failed to define a metric");
static ref NUM_ONDISK_LAYERS: IntGauge =
register_int_gauge!("pageserver_ondisk_layers", "Number of layers on-disk")
.expect("failed to define a metric");
@@ -37,98 +34,147 @@ lazy_static! {
///
#[derive(Default)]
pub struct LayerMap {
/// All the layers keyed by segment tag
segs: HashMap<SegmentTag, SegEntry>,
//
// 'open_layer' holds the current InMemoryLayer that is accepting new
// records. If it is None, 'next_open_layer_at' will be set instead, indicating
// where the start LSN of the next InMemoryLayer that is to be created.
//
pub open_layer: Option<Arc<InMemoryLayer>>,
pub next_open_layer_at: Option<Lsn>,
/// All in-memory layers, ordered by 'oldest_lsn' and generation
/// of each layer. This allows easy access to the in-memory layer that
/// contains the oldest WAL record.
open_layers: BinaryHeap<OpenLayerEntry>,
///
/// The frozen layer, if any, contains WAL older than the current 'open_layer'
/// or 'next_open_layer_at', but newer than any historic layer. The frozen
/// layer is during checkpointing, when an InMemoryLayer is being written out
/// to disk.
///
pub frozen_layers: VecDeque<Arc<InMemoryLayer>>,
/// Generation number, used to distinguish newly inserted entries in the
/// binary heap from older entries during checkpoint.
current_generation: u64,
/// All the historic layers are kept here
/// TODO: This is a placeholder implementation of a data structure
/// to hold information about all the layer files on disk and in
/// S3. Currently, it's just a vector and all operations perform a
/// linear scan over it. That obviously becomes slow as the
/// number of layers grows. I'm imagining that an R-tree or some
/// other 2D data structure would be the long-term solution here.
historic_layers: Vec<Arc<dyn Layer>>,
}
/// Return value of LayerMap::search
pub struct SearchResult {
pub layer: Arc<dyn Layer>,
pub lsn_floor: Lsn,
}
impl LayerMap {
///
/// Look up a layer using the given segment tag and LSN. This differs from a
/// plain key-value lookup in that if there is any layer that covers the
/// given LSN, or precedes the given LSN, it is returned. In other words,
/// you don't need to know the exact start LSN of the layer.
/// Find the latest layer that covers the given 'key', with lsn <
/// 'end_lsn'.
///
pub fn get(&self, tag: &SegmentTag, lsn: Lsn) -> Option<Arc<dyn Layer>> {
let segentry = self.segs.get(tag)?;
segentry.get(lsn)
}
/// Returns the layer, if any, and an 'lsn_floor' value that
/// indicates which portion of the layer the caller should
/// check. 'lsn_floor' is normally the start-LSN of the layer, but
/// can be greater if there is an overlapping layer that might
/// contain the version, even if it's missing from the returned
/// layer.
///
/// Get the open layer for given segment for writing. Or None if no open
/// layer exists.
///
pub fn get_open(&self, tag: &SegmentTag) -> Option<Arc<InMemoryLayer>> {
let segentry = self.segs.get(tag)?;
pub fn search(&self, key: Key, end_lsn: Lsn) -> Result<Option<SearchResult>> {
// linear search
// Find the latest image layer that covers the given key
let mut latest_img: Option<Arc<dyn Layer>> = None;
let mut latest_img_lsn: Option<Lsn> = None;
for l in self.historic_layers.iter() {
if l.is_incremental() {
continue;
}
if !l.get_key_range().contains(&key) {
continue;
}
let img_lsn = l.get_lsn_range().start;
segentry
.open_layer_id
.and_then(|layer_id| GLOBAL_LAYER_MAP.read().unwrap().get(&layer_id))
}
if img_lsn >= end_lsn {
// too new
continue;
}
if Lsn(img_lsn.0 + 1) == end_lsn {
// found exact match
return Ok(Some(SearchResult {
layer: Arc::clone(l),
lsn_floor: img_lsn,
}));
}
if img_lsn > latest_img_lsn.unwrap_or(Lsn(0)) {
latest_img = Some(Arc::clone(l));
latest_img_lsn = Some(img_lsn);
}
}
///
/// Insert an open in-memory layer
///
pub fn insert_open(&mut self, layer: Arc<InMemoryLayer>) {
let segentry = self.segs.entry(layer.get_seg_tag()).or_default();
let layer_id = segentry.update_open(Arc::clone(&layer));
let oldest_lsn = layer.get_oldest_lsn();
// After a crash and restart, 'oldest_lsn' of the oldest in-memory
// layer becomes the WAL streaming starting point, so it better not point
// in the middle of a WAL record.
assert!(oldest_lsn.is_aligned());
// Also add it to the binary heap
let open_layer_entry = OpenLayerEntry {
oldest_lsn: layer.get_oldest_lsn(),
layer_id,
generation: self.current_generation,
};
self.open_layers.push(open_layer_entry);
NUM_INMEMORY_LAYERS.inc();
}
/// Remove an open in-memory layer
pub fn remove_open(&mut self, layer_id: LayerId) {
// Note: we don't try to remove the entry from the binary heap.
// It will be removed lazily by peek_oldest_open() when it's made it to
// the top of the heap.
let layer_opt = {
let mut global_map = GLOBAL_LAYER_MAP.write().unwrap();
let layer_opt = global_map.get(&layer_id);
global_map.remove(&layer_id);
// TODO it's bad that a ref can still exist after being evicted from cache
layer_opt
};
if let Some(layer) = layer_opt {
let mut segentry = self.segs.get_mut(&layer.get_seg_tag()).unwrap();
if segentry.open_layer_id == Some(layer_id) {
// Also remove it from the SegEntry of this segment
segentry.open_layer_id = None;
} else {
// We could have already updated segentry.open for
// dropped (non-writeable) layer. This is fine.
assert!(!layer.is_writeable());
assert!(layer.is_dropped());
// Search the delta layers
let mut latest_delta: Option<Arc<dyn Layer>> = None;
for l in self.historic_layers.iter() {
if !l.is_incremental() {
continue;
}
if !l.get_key_range().contains(&key) {
continue;
}
NUM_INMEMORY_LAYERS.dec();
if l.get_lsn_range().start >= end_lsn {
// too new
continue;
}
if l.get_lsn_range().end >= end_lsn {
// this layer contains the requested point in the key/lsn space.
// No need to search any further
trace!(
"found layer {} for request on {} at {}",
l.filename().display(),
key,
end_lsn
);
latest_delta.replace(Arc::clone(l));
break;
}
// this layer's end LSN is smaller than the requested point. If there's
// nothing newer, this is what we need to return. Remember this.
if let Some(ref old_candidate) = latest_delta {
if l.get_lsn_range().end > old_candidate.get_lsn_range().end {
latest_delta.replace(Arc::clone(l));
}
} else {
latest_delta.replace(Arc::clone(l));
}
}
if let Some(l) = latest_delta {
trace!(
"found (old) layer {} for request on {} at {}",
l.filename().display(),
key,
end_lsn
);
let lsn_floor = std::cmp::max(
Lsn(latest_img_lsn.unwrap_or(Lsn(0)).0 + 1),
l.get_lsn_range().start,
);
Ok(Some(SearchResult {
lsn_floor,
layer: l,
}))
} else if let Some(l) = latest_img {
trace!(
"found img layer and no deltas for request on {} at {}",
key,
end_lsn
);
Ok(Some(SearchResult {
lsn_floor: latest_img_lsn.unwrap(),
layer: l,
}))
} else {
trace!("no layer found for request on {} at {}", key, end_lsn);
Ok(None)
}
}
@@ -136,9 +182,7 @@ impl LayerMap {
/// Insert an on-disk layer
///
pub fn insert_historic(&mut self, layer: Arc<dyn Layer>) {
let segentry = self.segs.entry(layer.get_seg_tag()).or_default();
segentry.insert_historic(layer);
self.historic_layers.push(layer);
NUM_ONDISK_LAYERS.inc();
}
@@ -147,348 +191,207 @@ impl LayerMap {
///
/// This should be called when the corresponding file on disk has been deleted.
///
#[allow(dead_code)]
pub fn remove_historic(&mut self, layer: Arc<dyn Layer>) {
let tag = layer.get_seg_tag();
let len_before = self.historic_layers.len();
if let Some(segentry) = self.segs.get_mut(&tag) {
segentry.historic.remove(&layer);
}
// FIXME: ptr_eq might fail to return true for 'dyn'
// references. Clippy complains about this. In practice it
// seems to work, the assertion below would be triggered
// otherwise but this ought to be fixed.
#[allow(clippy::vtable_address_comparisons)]
self.historic_layers
.retain(|other| !Arc::ptr_eq(other, &layer));
assert_eq!(self.historic_layers.len(), len_before - 1);
NUM_ONDISK_LAYERS.dec();
}
// List relations along with a flag that marks if they exist at the given lsn.
// spcnode 0 and dbnode 0 have special meanings and mean all tabespaces/databases.
// Pass Tag if we're only interested in some relations.
pub fn list_relishes(&self, tag: Option<RelTag>, lsn: Lsn) -> Result<HashMap<RelishTag, bool>> {
let mut rels: HashMap<RelishTag, bool> = HashMap::new();
for (seg, segentry) in self.segs.iter() {
match seg.rel {
RelishTag::Relation(reltag) => {
if let Some(request_rel) = tag {
if (request_rel.spcnode == 0 || reltag.spcnode == request_rel.spcnode)
&& (request_rel.dbnode == 0 || reltag.dbnode == request_rel.dbnode)
{
if let Some(exists) = segentry.exists_at_lsn(lsn)? {
rels.insert(seg.rel, exists);
}
}
}
}
_ => {
if tag == None {
if let Some(exists) = segentry.exists_at_lsn(lsn)? {
rels.insert(seg.rel, exists);
}
}
}
}
}
Ok(rels)
}
/// Is there a newer image layer for given segment?
/// Is there a newer image layer for given key-range?
///
/// This is used for garbage collection, to determine if an old layer can
/// be deleted.
/// We ignore segments newer than disk_consistent_lsn because they will be removed at restart
/// We ignore layers newer than disk_consistent_lsn because they will be removed at restart
/// We also only look at historic layers
//#[allow(dead_code)]
pub fn newer_image_layer_exists(
&self,
seg: SegmentTag,
key_range: &Range<Key>,
lsn: Lsn,
disk_consistent_lsn: Lsn,
) -> bool {
if let Some(segentry) = self.segs.get(&seg) {
segentry.newer_image_layer_exists(lsn, disk_consistent_lsn)
} else {
false
}
}
) -> Result<bool> {
let mut range_remain = key_range.clone();
/// Is there any layer for given segment that is alive at the lsn?
///
/// This is a public wrapper for SegEntry fucntion,
/// used for garbage collection, to determine if some alive layer
/// exists at the lsn. If so, we shouldn't delete a newer dropped layer
/// to avoid incorrectly making it visible.
pub fn layer_exists_at_lsn(&self, seg: SegmentTag, lsn: Lsn) -> Result<bool> {
Ok(if let Some(segentry) = self.segs.get(&seg) {
segentry.exists_at_lsn(lsn)?.unwrap_or(false)
} else {
false
})
}
loop {
let mut made_progress = false;
for l in self.historic_layers.iter() {
if l.is_incremental() {
continue;
}
let img_lsn = l.get_lsn_range().start;
if !l.is_incremental()
&& l.get_key_range().contains(&range_remain.start)
&& img_lsn > lsn
&& img_lsn < disk_consistent_lsn
{
made_progress = true;
let img_key_end = l.get_key_range().end;
/// Return the oldest in-memory layer, along with its generation number.
pub fn peek_oldest_open(&mut self) -> Option<(LayerId, Arc<InMemoryLayer>, u64)> {
let global_map = GLOBAL_LAYER_MAP.read().unwrap();
if img_key_end >= range_remain.end {
return Ok(true);
}
range_remain.start = img_key_end;
}
}
while let Some(oldest_entry) = self.open_layers.peek() {
if let Some(layer) = global_map.get(&oldest_entry.layer_id) {
return Some((oldest_entry.layer_id, layer, oldest_entry.generation));
} else {
self.open_layers.pop();
if !made_progress {
return Ok(false);
}
}
None
}
/// Increment the generation number used to stamp open in-memory layers. Layers
/// added with `insert_open` after this call will be associated with the new
/// generation. Returns the new generation number.
pub fn increment_generation(&mut self) -> u64 {
self.current_generation += 1;
self.current_generation
pub fn iter_historic_layers(&self) -> std::slice::Iter<Arc<dyn Layer>> {
self.historic_layers.iter()
}
pub fn iter_historic_layers(&self) -> HistoricLayerIter {
HistoricLayerIter {
seg_iter: self.segs.iter(),
iter: None,
/// Find the last image layer that covers 'key', ignoring any image layers
/// newer than 'lsn'.
fn find_latest_image(&self, key: Key, lsn: Lsn) -> Option<Arc<dyn Layer>> {
let mut candidate_lsn = Lsn(0);
let mut candidate = None;
for l in self.historic_layers.iter() {
if l.is_incremental() {
continue;
}
if !l.get_key_range().contains(&key) {
continue;
}
let this_lsn = l.get_lsn_range().start;
if this_lsn > lsn {
continue;
}
if this_lsn < candidate_lsn {
// our previous candidate was better
continue;
}
candidate_lsn = this_lsn;
candidate = Some(Arc::clone(l));
}
candidate
}
///
/// Divide the whole given range of keys into sub-ranges based on the latest
/// image layer that covers each range. (This is used when creating new
/// image layers)
///
// FIXME: clippy complains that the result type is very complex. She's probably
// right...
#[allow(clippy::type_complexity)]
pub fn image_coverage(
&self,
key_range: &Range<Key>,
lsn: Lsn,
) -> Result<Vec<(Range<Key>, Option<Arc<dyn Layer>>)>> {
let mut points = vec![key_range.start];
for l in self.historic_layers.iter() {
if l.get_lsn_range().start > lsn {
continue;
}
let range = l.get_key_range();
if key_range.contains(&range.start) {
points.push(l.get_key_range().start);
}
if key_range.contains(&range.end) {
points.push(l.get_key_range().end);
}
}
points.push(key_range.end);
points.sort();
points.dedup();
// Ok, we now have a list of "interesting" points in the key space
// For each range between the points, find the latest image
let mut start = *points.first().unwrap();
let mut ranges = Vec::new();
for end in points[1..].iter() {
let img = self.find_latest_image(start, lsn);
ranges.push((start..*end, img));
start = *end;
}
Ok(ranges)
}
/// Count how many L1 delta layers there are that overlap with the
/// given key and LSN range.
pub fn count_deltas(&self, key_range: &Range<Key>, lsn_range: &Range<Lsn>) -> Result<usize> {
let mut result = 0;
for l in self.historic_layers.iter() {
if !l.is_incremental() {
continue;
}
if !range_overlaps(&l.get_lsn_range(), lsn_range) {
continue;
}
if !range_overlaps(&l.get_key_range(), key_range) {
continue;
}
// We ignore level0 delta layers. Unless the whole keyspace fits
// into one partition
if !range_eq(key_range, &(Key::MIN..Key::MAX))
&& range_eq(&l.get_key_range(), &(Key::MIN..Key::MAX))
{
continue;
}
result += 1;
}
Ok(result)
}
/// Return all L0 delta layers
pub fn get_level0_deltas(&self) -> Result<Vec<Arc<dyn Layer>>> {
let mut deltas = Vec::new();
for l in self.historic_layers.iter() {
if !l.is_incremental() {
continue;
}
if l.get_key_range() != (Key::MIN..Key::MAX) {
continue;
}
deltas.push(Arc::clone(l));
}
Ok(deltas)
}
/// debugging function to print out the contents of the layer map
#[allow(unused)]
pub fn dump(&self) -> Result<()> {
pub fn dump(&self, verbose: bool) -> Result<()> {
println!("Begin dump LayerMap");
for (seg, segentry) in self.segs.iter() {
if let Some(open) = &segentry.open_layer_id {
if let Some(layer) = GLOBAL_LAYER_MAP.read().unwrap().get(open) {
layer.dump()?;
} else {
println!("layer not found in global map");
}
}
for layer in segentry.historic.iter() {
layer.dump()?;
}
println!("open_layer:");
if let Some(open_layer) = &self.open_layer {
open_layer.dump(verbose)?;
}
println!("frozen_layers:");
for frozen_layer in self.frozen_layers.iter() {
frozen_layer.dump(verbose)?;
}
println!("historic_layers:");
for layer in self.historic_layers.iter() {
layer.dump(verbose)?;
}
println!("End dump LayerMap");
Ok(())
}
}
impl IntervalItem for dyn Layer {
type Key = Lsn;
fn start_key(&self) -> Lsn {
self.get_start_lsn()
}
fn end_key(&self) -> Lsn {
self.get_end_lsn()
}
}
///
/// Per-segment entry in the LayerMap::segs hash map. Holds all the layers
/// associated with the segment.
///
/// The last layer that is open for writes is always an InMemoryLayer,
/// and is kept in a separate field, because there can be only one for
/// each segment. The older layers, stored on disk, are kept in an
/// IntervalTree.
#[derive(Default)]
struct SegEntry {
open_layer_id: Option<LayerId>,
historic: IntervalTree<dyn Layer>,
}
impl SegEntry {
/// Does the segment exist at given LSN?
/// Return None if object is not found in this SegEntry.
fn exists_at_lsn(&self, lsn: Lsn) -> Result<Option<bool>> {
if let Some(layer) = self.get(lsn) {
Ok(Some(layer.get_seg_exists(lsn)?))
} else {
Ok(None)
}
}
pub fn get(&self, lsn: Lsn) -> Option<Arc<dyn Layer>> {
if let Some(open_layer_id) = &self.open_layer_id {
let open_layer = GLOBAL_LAYER_MAP.read().unwrap().get(open_layer_id)?;
if open_layer.get_start_lsn() <= lsn {
return Some(open_layer);
}
}
self.historic.search(lsn)
}
pub fn newer_image_layer_exists(&self, lsn: Lsn, disk_consistent_lsn: Lsn) -> bool {
// We only check on-disk layers, because
// in-memory layers are not durable
// The end-LSN is exclusive, while disk_consistent_lsn is
// inclusive. For example, if disk_consistent_lsn is 100, it is
// OK for a delta layer to have end LSN 101, but if the end LSN
// is 102, then it might not have been fully flushed to disk
// before crash.
self.historic
.iter_newer(lsn)
.any(|layer| !layer.is_incremental() && layer.get_end_lsn() <= disk_consistent_lsn + 1)
}
// Set new open layer for a SegEntry.
// It's ok to rewrite previous open layer,
// but only if it is not writeable anymore.
pub fn update_open(&mut self, layer: Arc<InMemoryLayer>) -> LayerId {
if let Some(prev_open_layer_id) = &self.open_layer_id {
if let Some(prev_open_layer) = GLOBAL_LAYER_MAP.read().unwrap().get(prev_open_layer_id)
{
assert!(!prev_open_layer.is_writeable());
}
}
let open_layer_id = GLOBAL_LAYER_MAP.write().unwrap().insert(layer);
self.open_layer_id = Some(open_layer_id);
open_layer_id
}
pub fn insert_historic(&mut self, layer: Arc<dyn Layer>) {
self.historic.insert(layer);
}
}
/// Entry held in LayerMap::open_layers, with boilerplate comparison routines
/// to implement a min-heap ordered by 'oldest_lsn' and 'generation'
///
/// The generation number associated with each entry can be used to distinguish
/// recently-added entries (i.e after last call to increment_generation()) from older
/// entries with the same 'oldest_lsn'.
struct OpenLayerEntry {
oldest_lsn: Lsn, // copy of layer.get_oldest_lsn()
generation: u64,
layer_id: LayerId,
}
impl Ord for OpenLayerEntry {
fn cmp(&self, other: &Self) -> Ordering {
// BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here
// to get that. Entries with identical oldest_lsn are ordered by generation
other
.oldest_lsn
.cmp(&self.oldest_lsn)
.then_with(|| other.generation.cmp(&self.generation))
}
}
impl PartialOrd for OpenLayerEntry {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for OpenLayerEntry {
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal
}
}
impl Eq for OpenLayerEntry {}
/// Iterator returned by LayerMap::iter_historic_layers()
pub struct HistoricLayerIter<'a> {
seg_iter: std::collections::hash_map::Iter<'a, SegmentTag, SegEntry>,
iter: Option<IntervalIter<'a, dyn Layer>>,
}
impl<'a> Iterator for HistoricLayerIter<'a> {
type Item = Arc<dyn Layer>;
fn next(&mut self) -> std::option::Option<<Self as std::iter::Iterator>::Item> {
loop {
if let Some(x) = &mut self.iter {
if let Some(x) = x.next() {
return Some(Arc::clone(&x));
}
}
if let Some((_tag, segentry)) = self.seg_iter.next() {
self.iter = Some(segentry.historic.iter());
continue;
} else {
return None;
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::config::PageServerConf;
use std::str::FromStr;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
/// Arbitrary relation tag, for testing.
const TESTREL_A: RelishTag = RelishTag::Relation(RelTag {
spcnode: 0,
dbnode: 111,
relnode: 1000,
forknum: 0,
});
lazy_static! {
static ref DUMMY_TIMELINEID: ZTimelineId =
ZTimelineId::from_str("00000000000000000000000000000000").unwrap();
static ref DUMMY_TENANTID: ZTenantId =
ZTenantId::from_str("00000000000000000000000000000000").unwrap();
}
/// Construct a dummy InMemoryLayer for testing
fn dummy_inmem_layer(
conf: &'static PageServerConf,
segno: u32,
start_lsn: Lsn,
oldest_lsn: Lsn,
) -> Arc<InMemoryLayer> {
Arc::new(
InMemoryLayer::create(
conf,
*DUMMY_TIMELINEID,
*DUMMY_TENANTID,
SegmentTag {
rel: TESTREL_A,
segno,
},
start_lsn,
oldest_lsn,
)
.unwrap(),
)
}
#[test]
fn test_open_layers() -> Result<()> {
let conf = PageServerConf::dummy_conf(PageServerConf::test_repo_dir("dummy_inmem_layer"));
let conf = Box::leak(Box::new(conf));
std::fs::create_dir_all(conf.timeline_path(&DUMMY_TIMELINEID, &DUMMY_TENANTID))?;
let mut layers = LayerMap::default();
let gen1 = layers.increment_generation();
layers.insert_open(dummy_inmem_layer(conf, 0, Lsn(0x100), Lsn(0x100)));
layers.insert_open(dummy_inmem_layer(conf, 1, Lsn(0x100), Lsn(0x200)));
layers.insert_open(dummy_inmem_layer(conf, 2, Lsn(0x100), Lsn(0x120)));
layers.insert_open(dummy_inmem_layer(conf, 3, Lsn(0x100), Lsn(0x110)));
let gen2 = layers.increment_generation();
layers.insert_open(dummy_inmem_layer(conf, 4, Lsn(0x100), Lsn(0x110)));
layers.insert_open(dummy_inmem_layer(conf, 5, Lsn(0x100), Lsn(0x100)));
// A helper function (closure) to pop the next oldest open entry from the layer map,
// and assert that it is what we'd expect
let mut assert_pop_layer = |expected_segno: u32, expected_generation: u64| {
let (layer_id, l, generation) = layers.peek_oldest_open().unwrap();
assert!(l.get_seg_tag().segno == expected_segno);
assert!(generation == expected_generation);
layers.remove_open(layer_id);
};
assert_pop_layer(0, gen1); // 0x100
assert_pop_layer(5, gen2); // 0x100
assert_pop_layer(3, gen1); // 0x110
assert_pop_layer(4, gen2); // 0x110
assert_pop_layer(2, gen1); // 0x120
assert_pop_layer(1, gen1); // 0x200
Ok(())
}
}

View File

@@ -6,9 +6,10 @@
//!
//! The module contains all structs and related helper methods related to timeline metadata.
use std::{convert::TryInto, path::PathBuf};
use std::path::PathBuf;
use anyhow::ensure;
use serde::{Deserialize, Serialize};
use zenith_utils::{
bin_ser::BeSer,
lsn::Lsn,
@@ -16,11 +17,13 @@ use zenith_utils::{
};
use crate::config::PageServerConf;
use crate::STORAGE_FORMAT_VERSION;
// Taken from PG_CONTROL_MAX_SAFE_SIZE
const METADATA_MAX_SAFE_SIZE: usize = 512;
const METADATA_CHECKSUM_SIZE: usize = std::mem::size_of::<u32>();
const METADATA_MAX_DATA_SIZE: usize = METADATA_MAX_SAFE_SIZE - METADATA_CHECKSUM_SIZE;
/// We assume that a write of up to METADATA_MAX_SIZE bytes is atomic.
///
/// This is the same assumption that PostgreSQL makes with the control file,
/// see PG_CONTROL_MAX_SAFE_SIZE
const METADATA_MAX_SIZE: usize = 512;
/// The name of the metadata file pageserver creates per timeline.
pub const METADATA_FILE_NAME: &str = "metadata";
@@ -28,8 +31,22 @@ pub const METADATA_FILE_NAME: &str = "metadata";
/// Metadata stored on disk for each timeline
///
/// The fields correspond to the values we hold in memory, in LayeredTimeline.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct TimelineMetadata {
hdr: TimelineMetadataHeader,
body: TimelineMetadataBody,
}
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
struct TimelineMetadataHeader {
checksum: u32, // CRC of serialized metadata body
size: u16, // size of serialized metadata
format_version: u16, // storage format version (used for compatibility checks)
}
const METADATA_HDR_SIZE: usize = std::mem::size_of::<TimelineMetadataHeader>();
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
struct TimelineMetadataBody {
disk_consistent_lsn: Lsn,
// This is only set if we know it. We track it in memory when the page
// server is running, but we only track the value corresponding to
@@ -69,130 +86,90 @@ impl TimelineMetadata {
initdb_lsn: Lsn,
) -> Self {
Self {
disk_consistent_lsn,
prev_record_lsn,
ancestor_timeline,
ancestor_lsn,
latest_gc_cutoff_lsn,
initdb_lsn,
hdr: TimelineMetadataHeader {
checksum: 0,
size: 0,
format_version: STORAGE_FORMAT_VERSION,
},
body: TimelineMetadataBody {
disk_consistent_lsn,
prev_record_lsn,
ancestor_timeline,
ancestor_lsn,
latest_gc_cutoff_lsn,
initdb_lsn,
},
}
}
pub fn from_bytes(metadata_bytes: &[u8]) -> anyhow::Result<Self> {
ensure!(
metadata_bytes.len() == METADATA_MAX_SAFE_SIZE,
metadata_bytes.len() == METADATA_MAX_SIZE,
"metadata bytes size is wrong"
);
let data = &metadata_bytes[..METADATA_MAX_DATA_SIZE];
let calculated_checksum = crc32c::crc32c(data);
let checksum_bytes: &[u8; METADATA_CHECKSUM_SIZE] =
metadata_bytes[METADATA_MAX_DATA_SIZE..].try_into()?;
let expected_checksum = u32::from_le_bytes(*checksum_bytes);
let hdr = TimelineMetadataHeader::des(&metadata_bytes[0..METADATA_HDR_SIZE])?;
ensure!(
calculated_checksum == expected_checksum,
hdr.format_version == STORAGE_FORMAT_VERSION,
"format version mismatch"
);
let metadata_size = hdr.size as usize;
ensure!(
metadata_size <= METADATA_MAX_SIZE,
"corrupted metadata file"
);
let calculated_checksum = crc32c::crc32c(&metadata_bytes[METADATA_HDR_SIZE..metadata_size]);
ensure!(
hdr.checksum == calculated_checksum,
"metadata checksum mismatch"
);
let body = TimelineMetadataBody::des(&metadata_bytes[METADATA_HDR_SIZE..metadata_size])?;
ensure!(
body.disk_consistent_lsn.is_aligned(),
"disk_consistent_lsn is not aligned"
);
let data = TimelineMetadata::from(serialize::DeTimelineMetadata::des_prefix(data)?);
assert!(data.disk_consistent_lsn.is_aligned());
Ok(data)
Ok(TimelineMetadata { hdr, body })
}
pub fn to_bytes(&self) -> anyhow::Result<Vec<u8>> {
let serializeable_metadata = serialize::SeTimelineMetadata::from(self);
let mut metadata_bytes = serialize::SeTimelineMetadata::ser(&serializeable_metadata)?;
assert!(metadata_bytes.len() <= METADATA_MAX_DATA_SIZE);
metadata_bytes.resize(METADATA_MAX_SAFE_SIZE, 0u8);
let checksum = crc32c::crc32c(&metadata_bytes[..METADATA_MAX_DATA_SIZE]);
metadata_bytes[METADATA_MAX_DATA_SIZE..].copy_from_slice(&u32::to_le_bytes(checksum));
let body_bytes = self.body.ser()?;
let metadata_size = METADATA_HDR_SIZE + body_bytes.len();
let hdr = TimelineMetadataHeader {
size: metadata_size as u16,
format_version: STORAGE_FORMAT_VERSION,
checksum: crc32c::crc32c(&body_bytes),
};
let hdr_bytes = hdr.ser()?;
let mut metadata_bytes = vec![0u8; METADATA_MAX_SIZE];
metadata_bytes[0..METADATA_HDR_SIZE].copy_from_slice(&hdr_bytes);
metadata_bytes[METADATA_HDR_SIZE..metadata_size].copy_from_slice(&body_bytes);
Ok(metadata_bytes)
}
/// [`Lsn`] that corresponds to the corresponding timeline directory
/// contents, stored locally in the pageserver workdir.
pub fn disk_consistent_lsn(&self) -> Lsn {
self.disk_consistent_lsn
self.body.disk_consistent_lsn
}
pub fn prev_record_lsn(&self) -> Option<Lsn> {
self.prev_record_lsn
self.body.prev_record_lsn
}
pub fn ancestor_timeline(&self) -> Option<ZTimelineId> {
self.ancestor_timeline
self.body.ancestor_timeline
}
pub fn ancestor_lsn(&self) -> Lsn {
self.ancestor_lsn
self.body.ancestor_lsn
}
pub fn latest_gc_cutoff_lsn(&self) -> Lsn {
self.latest_gc_cutoff_lsn
self.body.latest_gc_cutoff_lsn
}
pub fn initdb_lsn(&self) -> Lsn {
self.initdb_lsn
}
}
/// This module is for direct conversion of metadata to bytes and back.
/// For a certain metadata, besides the conversion a few verification steps has to
/// be done, so all serde derives are hidden from the user, to avoid accidental
/// verification-less metadata creation.
mod serialize {
use serde::{Deserialize, Serialize};
use zenith_utils::{lsn::Lsn, zid::ZTimelineId};
use super::TimelineMetadata;
#[derive(Serialize)]
pub(super) struct SeTimelineMetadata<'a> {
disk_consistent_lsn: &'a Lsn,
prev_record_lsn: &'a Option<Lsn>,
ancestor_timeline: &'a Option<ZTimelineId>,
ancestor_lsn: &'a Lsn,
latest_gc_cutoff_lsn: &'a Lsn,
initdb_lsn: &'a Lsn,
}
impl<'a> From<&'a TimelineMetadata> for SeTimelineMetadata<'a> {
fn from(other: &'a TimelineMetadata) -> Self {
Self {
disk_consistent_lsn: &other.disk_consistent_lsn,
prev_record_lsn: &other.prev_record_lsn,
ancestor_timeline: &other.ancestor_timeline,
ancestor_lsn: &other.ancestor_lsn,
latest_gc_cutoff_lsn: &other.latest_gc_cutoff_lsn,
initdb_lsn: &other.initdb_lsn,
}
}
}
#[derive(Deserialize)]
pub(super) struct DeTimelineMetadata {
disk_consistent_lsn: Lsn,
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<ZTimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
}
impl From<DeTimelineMetadata> for TimelineMetadata {
fn from(other: DeTimelineMetadata) -> Self {
Self {
disk_consistent_lsn: other.disk_consistent_lsn,
prev_record_lsn: other.prev_record_lsn,
ancestor_timeline: other.ancestor_timeline,
ancestor_lsn: other.ancestor_lsn,
latest_gc_cutoff_lsn: other.latest_gc_cutoff_lsn,
initdb_lsn: other.initdb_lsn,
}
}
self.body.initdb_lsn
}
}
@@ -204,14 +181,14 @@ mod tests {
#[test]
fn metadata_serializes_correctly() {
let original_metadata = TimelineMetadata {
disk_consistent_lsn: Lsn(0x200),
prev_record_lsn: Some(Lsn(0x100)),
ancestor_timeline: Some(TIMELINE_ID),
ancestor_lsn: Lsn(0),
latest_gc_cutoff_lsn: Lsn(0),
initdb_lsn: Lsn(0),
};
let original_metadata = TimelineMetadata::new(
Lsn(0x200),
Some(Lsn(0x100)),
Some(TIMELINE_ID),
Lsn(0),
Lsn(0),
Lsn(0),
);
let metadata_bytes = original_metadata
.to_bytes()
@@ -221,7 +198,7 @@ mod tests {
.expect("Should deserialize its own bytes");
assert_eq!(
deserialized_metadata, original_metadata,
deserialized_metadata.body, original_metadata.body,
"Metadata that was serialized to bytes and deserialized back should not change"
);
}

View File

@@ -2,139 +2,101 @@
//! Common traits and structs for layers
//!
use crate::relish::RelishTag;
use crate::repository::{BlockNumber, ZenithWalRecord};
use crate::repository::{Key, Value};
use crate::walrecord::ZenithWalRecord;
use crate::{ZTenantId, ZTimelineId};
use anyhow::Result;
use bytes::Bytes;
use serde::{Deserialize, Serialize};
use std::fmt;
use std::ops::Range;
use std::path::PathBuf;
use zenith_utils::lsn::Lsn;
// Size of one segment in pages (10 MB)
pub const RELISH_SEG_SIZE: u32 = 10 * 1024 * 1024 / 8192;
///
/// Each relish stored in the repository is divided into fixed-sized "segments",
/// with 10 MB of key-space, or 1280 8k pages each.
///
#[derive(Debug, PartialEq, Eq, PartialOrd, Hash, Ord, Clone, Copy, Serialize, Deserialize)]
pub struct SegmentTag {
pub rel: RelishTag,
pub segno: u32,
}
/// SegmentBlk represents a block number within a segment, or the size of segment.
///
/// This is separate from BlockNumber, which is used for block number within the
/// whole relish. Since this is just a type alias, the compiler will let you mix
/// them freely, but we use the type alias as documentation to make it clear
/// which one we're dealing with.
///
/// (We could turn this into "struct SegmentBlk(u32)" to forbid accidentally
/// assigning a BlockNumber to SegmentBlk or vice versa, but that makes
/// operations more verbose).
pub type SegmentBlk = u32;
impl fmt::Display for SegmentTag {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{}.{}", self.rel, self.segno)
pub fn range_overlaps<T>(a: &Range<T>, b: &Range<T>) -> bool
where
T: PartialOrd<T>,
{
if a.start < b.start {
a.end > b.start
} else {
b.end > a.start
}
}
impl SegmentTag {
/// Given a relish and block number, calculate the corresponding segment and
/// block number within the segment.
pub const fn from_blknum(rel: RelishTag, blknum: BlockNumber) -> (SegmentTag, SegmentBlk) {
(
SegmentTag {
rel,
segno: blknum / RELISH_SEG_SIZE,
},
blknum % RELISH_SEG_SIZE,
)
}
pub fn range_eq<T>(a: &Range<T>, b: &Range<T>) -> bool
where
T: PartialEq<T>,
{
a.start == b.start && a.end == b.end
}
/// Struct used to communicate across calls to 'get_value_reconstruct_data'.
///
/// Represents a version of a page at a specific LSN. The LSN is the key of the
/// entry in the 'page_versions' hash, it is not duplicated here.
/// Before first call, you can fill in 'page_img' if you have an older cached
/// version of the page available. That can save work in
/// 'get_value_reconstruct_data', as it can stop searching for page versions
/// when all the WAL records going back to the cached image have been collected.
///
/// A page version can be stored as a full page image, or as WAL record that needs
/// to be applied over the previous page version to reconstruct this version.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum PageVersion {
Page(Bytes),
Wal(ZenithWalRecord),
}
///
/// Struct used to communicate across calls to 'get_page_reconstruct_data'.
///
/// Before first call to get_page_reconstruct_data, you can fill in 'page_img'
/// if you have an older cached version of the page available. That can save
/// work in 'get_page_reconstruct_data', as it can stop searching for page
/// versions when all the WAL records going back to the cached image have been
/// collected.
///
/// When get_page_reconstruct_data returns Complete, 'page_img' is set to an
/// image of the page, or the oldest WAL record in 'records' is a will_init-type
/// When get_value_reconstruct_data returns Complete, 'img' is set to an image
/// of the page, or the oldest WAL record in 'records' is a will_init-type
/// record that initializes the page without requiring a previous image.
///
/// If 'get_page_reconstruct_data' returns Continue, some 'records' may have
/// been collected, but there are more records outside the current layer. Pass
/// the same PageReconstructData struct in the next 'get_page_reconstruct_data'
/// the same ValueReconstructState struct in the next 'get_value_reconstruct_data'
/// call, to collect more records.
///
pub struct PageReconstructData {
#[derive(Debug)]
pub struct ValueReconstructState {
pub records: Vec<(Lsn, ZenithWalRecord)>,
pub page_img: Option<(Lsn, Bytes)>,
pub img: Option<(Lsn, Bytes)>,
}
/// Return value from Layer::get_page_reconstruct_data
pub enum PageReconstructResult {
#[derive(Clone, Copy, Debug)]
pub enum ValueReconstructResult {
/// Got all the data needed to reconstruct the requested page
Complete,
/// This layer didn't contain all the required data, the caller should look up
/// the predecessor layer at the returned LSN and collect more data from there.
Continue(Lsn),
Continue,
/// This layer didn't contain data needed to reconstruct the page version at
/// the returned LSN. This is usually considered an error, but might be OK
/// in some circumstances.
Missing(Lsn),
Missing,
}
/// A Layer contains all data in a "rectangle" consisting of a range of keys and
/// range of LSNs.
///
/// A Layer corresponds to one RELISH_SEG_SIZE slice of a relish in a range of LSNs.
/// There are two kinds of layers, in-memory and on-disk layers. In-memory
/// layers are used to ingest incoming WAL, and provide fast access
/// to the recent page versions. On-disk layers are stored as files on disk, and
/// are immutable. This trait presents the common functionality of
/// in-memory and on-disk layers.
/// layers are used to ingest incoming WAL, and provide fast access to the
/// recent page versions. On-disk layers are stored as files on disk, and are
/// immutable. This trait presents the common functionality of in-memory and
/// on-disk layers.
///
/// Furthermore, there are two kinds of on-disk layers: delta and image layers.
/// A delta layer contains all modifications within a range of LSNs and keys.
/// An image layer is a snapshot of all the data in a key-range, at a single
/// LSN
///
pub trait Layer: Send + Sync {
fn get_tenant_id(&self) -> ZTenantId;
/// Identify the timeline this relish belongs to
/// Identify the timeline this layer belongs to
fn get_timeline_id(&self) -> ZTimelineId;
/// Identify the relish segment
fn get_seg_tag(&self) -> SegmentTag;
/// Range of keys that this layer covers
fn get_key_range(&self) -> Range<Key>;
/// Inclusive start bound of the LSN range that this layer holds
fn get_start_lsn(&self) -> Lsn;
/// Exclusive end bound of the LSN range that this layer holds.
///
/// - For an open in-memory layer, this is MAX_LSN.
/// - For a frozen in-memory layer or a delta layer, this is a valid end bound.
/// - An image layer represents snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
fn get_end_lsn(&self) -> Lsn;
/// Is the segment represented by this layer dropped by PostgreSQL?
fn is_dropped(&self) -> bool;
fn get_lsn_range(&self) -> Range<Lsn>;
/// Filename used to store this layer on disk. (Even in-memory layers
/// implement this, to print a handy unique identifier for the layer for
@@ -153,20 +115,14 @@ pub trait Layer: Send + Sync {
/// is available. If this returns PageReconstructResult::Continue, look up
/// the predecessor layer and call again with the same 'reconstruct_data' to
/// collect more data.
fn get_page_reconstruct_data(
fn get_value_reconstruct_data(
&self,
blknum: SegmentBlk,
lsn: Lsn,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult>;
key: Key,
lsn_range: Range<Lsn>,
reconstruct_data: &mut ValueReconstructState,
) -> Result<ValueReconstructResult>;
/// Return size of the segment at given LSN. (Only for blocky relations.)
fn get_seg_size(&self, lsn: Lsn) -> Result<SegmentBlk>;
/// Does the segment exist at given LSN? Or was it dropped before it.
fn get_seg_exists(&self, lsn: Lsn) -> Result<bool>;
/// Does this layer only contain some data for the segment (incremental),
/// Does this layer only contain some data for the key-range (incremental),
/// or does it contain a version of every page? This is important to know
/// for garbage collecting old layers: an incremental layer depends on
/// the previous non-incremental layer.
@@ -175,13 +131,12 @@ pub trait Layer: Send + Sync {
/// Returns true for layers that are represented in memory.
fn is_in_memory(&self) -> bool;
/// Release memory used by this layer. There is no corresponding 'load'
/// function, that's done implicitly when you call one of the get-functions.
fn unload(&self) -> Result<()>;
/// Iterate through all keys and values stored in the layer
fn iter(&self) -> Box<dyn Iterator<Item = Result<(Key, Lsn, Value)>> + '_>;
/// Permanently remove this layer from disk.
fn delete(&self) -> Result<()>;
/// Dump summary of the contents of the layer to stdout
fn dump(&self) -> Result<()>;
fn dump(&self, verbose: bool) -> Result<()>;
}

View File

@@ -2,10 +2,12 @@ pub mod basebackup;
pub mod config;
pub mod http;
pub mod import_datadir;
pub mod keyspace;
pub mod layered_repository;
pub mod page_cache;
pub mod page_service;
pub mod relish;
pub mod pgdatadir_mapping;
pub mod reltag;
pub mod remote_storage;
pub mod repository;
pub mod tenant_mgr;
@@ -19,8 +21,28 @@ pub mod walrecord;
pub mod walredo;
use lazy_static::lazy_static;
use tracing::info;
use zenith_metrics::{register_int_gauge_vec, IntGaugeVec};
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use zenith_utils::{
postgres_backend,
zid::{ZTenantId, ZTimelineId},
};
use crate::thread_mgr::ThreadKind;
use layered_repository::LayeredRepository;
use pgdatadir_mapping::DatadirTimeline;
/// Current storage format version
///
/// This is embedded in the metadata file, and also in the header of all the
/// layer files. If you make any backwards-incompatible changes to the storage
/// format, bump this!
pub const STORAGE_FORMAT_VERSION: u16 = 3;
// Magic constants used to identify different kinds of files
pub const IMAGE_FILE_MAGIC: u16 = 0x5A60;
pub const DELTA_FILE_MAGIC: u16 = 0x5A61;
lazy_static! {
static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!(
@@ -36,10 +58,42 @@ pub const LOG_FILE_NAME: &str = "pageserver.log";
/// Config for the Repository checkpointer
#[derive(Debug, Clone, Copy)]
pub enum CheckpointConfig {
// Flush in-memory data that is older than this
Distance(u64),
// Flush all in-memory data
Flush,
// Flush all in-memory data and reconstruct all page images
Forced,
}
pub type RepositoryImpl = LayeredRepository;
pub type DatadirTimelineImpl = DatadirTimeline<RepositoryImpl>;
pub fn shutdown_pageserver() {
// Shut down the libpq endpoint thread. This prevents new connections from
// being accepted.
thread_mgr::shutdown_threads(Some(ThreadKind::LibpqEndpointListener), None, None);
// Shut down any page service threads.
postgres_backend::set_pgbackend_shutdown_requested();
thread_mgr::shutdown_threads(Some(ThreadKind::PageRequestHandler), None, None);
// Shut down all the tenants. This flushes everything to disk and kills
// the checkpoint and GC threads.
tenant_mgr::shutdown_all_tenants();
// Stop syncing with remote storage.
//
// FIXME: Does this wait for the sync thread to finish syncing what's queued up?
// Should it?
thread_mgr::shutdown_threads(Some(ThreadKind::StorageSync), None, None);
// Shut down the HTTP endpoint last, so that you can still check the server's
// status while it's shutting down.
thread_mgr::shutdown_threads(Some(ThreadKind::HttpEndpointListener), None, None);
// There should be nothing left, but let's be sure
thread_mgr::shutdown_threads(None, None, None);
info!("Shut down successfully completed");
std::process::exit(0);
}

View File

@@ -41,7 +41,7 @@ use std::{
convert::TryInto,
sync::{
atomic::{AtomicU8, AtomicUsize, Ordering},
RwLock, RwLockReadGuard, RwLockWriteGuard,
RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError,
},
};
@@ -53,19 +53,16 @@ use zenith_utils::{
};
use crate::layered_repository::writeback_ephemeral_file;
use crate::{config::PageServerConf, relish::RelTag};
use crate::repository::Key;
static PAGE_CACHE: OnceCell<PageCache> = OnceCell::new();
const TEST_PAGE_CACHE_SIZE: usize = 10;
const TEST_PAGE_CACHE_SIZE: usize = 50;
///
/// Initialize the page cache. This must be called once at page server startup.
///
pub fn init(conf: &'static PageServerConf) {
if PAGE_CACHE
.set(PageCache::new(conf.page_cache_size))
.is_err()
{
pub fn init(size: usize) {
if PAGE_CACHE.set(PageCache::new(size)).is_err() {
panic!("page cache already initialized");
}
}
@@ -93,6 +90,7 @@ const MAX_USAGE_COUNT: u8 = 5;
/// CacheKey uniquely identifies a "thing" to cache in the page cache.
///
#[derive(Debug, PartialEq, Eq, Clone)]
#[allow(clippy::enum_variant_names)]
enum CacheKey {
MaterializedPage {
hash_key: MaterializedPageHashKey,
@@ -102,14 +100,17 @@ enum CacheKey {
file_id: u64,
blkno: u32,
},
ImmutableFilePage {
file_id: u64,
blkno: u32,
},
}
#[derive(Debug, PartialEq, Eq, Hash, Clone)]
struct MaterializedPageHashKey {
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
key: Key,
}
#[derive(Clone)]
@@ -177,6 +178,8 @@ pub struct PageCache {
ephemeral_page_map: RwLock<HashMap<(u64, u32), usize>>,
immutable_page_map: RwLock<HashMap<(u64, u32), usize>>,
/// The actual buffers with their metadata.
slots: Box<[Slot]>,
@@ -199,6 +202,12 @@ impl std::ops::Deref for PageReadGuard<'_> {
}
}
impl AsRef<[u8; PAGE_SZ]> for PageReadGuard<'_> {
fn as_ref(&self) -> &[u8; PAGE_SZ] {
self.0.buf
}
}
///
/// PageWriteGuard is a lease on a buffer for modifying it. The page is kept locked
/// until the guard is dropped.
@@ -230,6 +239,12 @@ impl std::ops::Deref for PageWriteGuard<'_> {
}
}
impl AsMut<[u8; PAGE_SZ]> for PageWriteGuard<'_> {
fn as_mut(&mut self) -> &mut [u8; PAGE_SZ] {
self.inner.buf
}
}
impl PageWriteGuard<'_> {
/// Mark that the buffer contents are now valid.
pub fn mark_valid(&mut self) {
@@ -294,16 +309,14 @@ impl PageCache {
&self,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
key: &Key,
lsn: Lsn,
) -> Option<(Lsn, PageReadGuard)> {
let mut cache_key = CacheKey::MaterializedPage {
hash_key: MaterializedPageHashKey {
tenant_id,
timeline_id,
rel_tag,
blknum,
key: *key,
},
lsn,
};
@@ -326,8 +339,7 @@ impl PageCache {
&self,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
key: Key,
lsn: Lsn,
img: &[u8],
) {
@@ -335,8 +347,7 @@ impl PageCache {
hash_key: MaterializedPageHashKey {
tenant_id,
timeline_id,
rel_tag,
blknum,
key,
},
lsn,
};
@@ -389,6 +400,36 @@ impl PageCache {
}
}
// Section 1.3: Public interface functions for working with immutable file pages.
pub fn read_immutable_buf(&self, file_id: u64, blkno: u32) -> ReadBufResult {
let mut cache_key = CacheKey::ImmutableFilePage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_immutable(&self, drop_file_id: u64) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::ImmutableFilePage { file_id, blkno: _ }
if *file_id == drop_file_id =>
{
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
inner.dirty = false;
}
_ => {}
}
}
}
}
//
// Section 2: Internal interface functions for lookup/update.
//
@@ -586,6 +627,10 @@ impl PageCache {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let map = self.immutable_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
}
}
@@ -609,6 +654,10 @@ impl PageCache {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let map = self.immutable_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
}
}
@@ -640,6 +689,11 @@ impl PageCache {
map.remove(&(*file_id, *blkno))
.expect("could not find old key in mapping");
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let mut map = self.immutable_page_map.write().unwrap();
map.remove(&(*file_id, *blkno))
.expect("could not find old key in mapping");
}
}
}
@@ -680,6 +734,16 @@ impl PageCache {
}
}
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let mut map = self.immutable_page_map.write().unwrap();
match map.entry((*file_id, *blkno)) {
Entry::Occupied(entry) => Some(*entry.get()),
Entry::Vacant(entry) => {
entry.insert(slot_idx);
None
}
}
}
}
}
@@ -691,16 +755,33 @@ impl PageCache {
///
/// On return, the slot is empty and write-locked.
fn find_victim(&self) -> (usize, RwLockWriteGuard<SlotInner>) {
let iter_limit = self.slots.len() * 2;
let iter_limit = self.slots.len() * 10;
let mut iters = 0;
loop {
iters += 1;
let slot_idx = self.next_evict_slot.fetch_add(1, Ordering::Relaxed) % self.slots.len();
let slot = &self.slots[slot_idx];
if slot.dec_usage_count() == 0 || iters >= iter_limit {
let mut inner = slot.inner.write().unwrap();
if slot.dec_usage_count() == 0 {
let mut inner = match slot.inner.try_write() {
Ok(inner) => inner,
Err(TryLockError::Poisoned(err)) => {
panic!("buffer lock was poisoned: {:?}", err)
}
Err(TryLockError::WouldBlock) => {
// If we have looped through the whole buffer pool 10 times
// and still haven't found a victim buffer, something's wrong.
// Maybe all the buffers were in locked. That could happen in
// theory, if you have more threads holding buffers locked than
// there are buffers in the pool. In practice, with a reasonably
// large buffer pool it really shouldn't happen.
if iters > iter_limit {
panic!("could not find a victim buffer to evict");
}
continue;
}
};
if let Some(old_key) = &inner.key {
if inner.dirty {
if let Err(err) = Self::writeback(old_key, inner.buf) {
@@ -725,8 +806,6 @@ impl PageCache {
}
return (slot_idx, inner);
}
iters += 1;
}
}
@@ -735,12 +814,20 @@ impl PageCache {
CacheKey::MaterializedPage {
hash_key: _,
lsn: _,
} => {
panic!("unexpected dirty materialized page");
}
} => Err(std::io::Error::new(
std::io::ErrorKind::Other,
"unexpected dirty materialized page",
)),
CacheKey::EphemeralPage { file_id, blkno } => {
writeback_ephemeral_file(*file_id, *blkno, buf)
}
CacheKey::ImmutableFilePage {
file_id: _,
blkno: _,
} => Err(std::io::Error::new(
std::io::ErrorKind::Other,
"unexpected dirty immutable page",
)),
}
}
@@ -771,6 +858,7 @@ impl PageCache {
Self {
materialized_page_map: Default::default(),
ephemeral_page_map: Default::default(),
immutable_page_map: Default::default(),
slots,
next_evict_slot: AtomicUsize::new(0),
}

View File

@@ -32,7 +32,9 @@ use zenith_utils::zid::{ZTenantId, ZTimelineId};
use crate::basebackup;
use crate::config::PageServerConf;
use crate::relish::*;
use crate::pgdatadir_mapping::DatadirTimeline;
use crate::reltag::RelTag;
use crate::repository::Repository;
use crate::repository::Timeline;
use crate::tenant_mgr;
use crate::thread_mgr;
@@ -228,6 +230,7 @@ pub fn thread_main(
None,
None,
"serving Page Service thread",
false,
move || page_service_conn_main(conf, local_auth, socket, auth_type),
) {
// Thread creation failed. Log the error and continue.
@@ -322,8 +325,8 @@ impl PageServerHandler {
let _enter = info_span!("pagestream", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Cannot handle pagerequests for a remote timeline")?;
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
/* switch client to COPYBOTH */
pgb.write_message(&BeMessage::CopyBothResponse)?;
@@ -397,8 +400,8 @@ impl PageServerHandler {
/// In either case, if the page server hasn't received the WAL up to the
/// requested LSN yet, we will wait for it to arrive. The return value is
/// the LSN that should be used to look up the page versions.
fn wait_or_get_last_lsn(
timeline: &dyn Timeline,
fn wait_or_get_last_lsn<R: Repository>(
timeline: &DatadirTimeline<R>,
mut lsn: Lsn,
latest: bool,
latest_gc_cutoff_lsn: &RwLockReadGuard<Lsn>,
@@ -425,7 +428,7 @@ impl PageServerHandler {
if lsn <= last_record_lsn {
lsn = last_record_lsn;
} else {
timeline.wait_lsn(lsn)?;
timeline.tline.wait_lsn(lsn)?;
// Since we waited for 'lsn' to arrive, that is now the last
// record LSN. (Or close enough for our purposes; the
// last-record LSN can advance immediately after we return
@@ -435,7 +438,7 @@ impl PageServerHandler {
if lsn == Lsn(0) {
bail!("invalid LSN(0) in request");
}
timeline.wait_lsn(lsn)?;
timeline.tline.wait_lsn(lsn)?;
}
ensure!(
lsn >= **latest_gc_cutoff_lsn,
@@ -445,54 +448,47 @@ impl PageServerHandler {
Ok(lsn)
}
fn handle_get_rel_exists_request(
fn handle_get_rel_exists_request<R: Repository>(
&self,
timeline: &dyn Timeline,
timeline: &DatadirTimeline<R>,
req: &PagestreamExistsRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_rel_exists", rel = %req.rel, req_lsn = %req.lsn).entered();
let tag = RelishTag::Relation(req.rel);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?;
let exists = timeline.get_rel_exists(tag, lsn)?;
let exists = timeline.get_rel_exists(req.rel, lsn)?;
Ok(PagestreamBeMessage::Exists(PagestreamExistsResponse {
exists,
}))
}
fn handle_get_nblocks_request(
fn handle_get_nblocks_request<R: Repository>(
&self,
timeline: &dyn Timeline,
timeline: &DatadirTimeline<R>,
req: &PagestreamNblocksRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_nblocks", rel = %req.rel, req_lsn = %req.lsn).entered();
let tag = RelishTag::Relation(req.rel);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?;
let n_blocks = timeline.get_relish_size(tag, lsn)?;
// Return 0 if relation is not found.
// This is what postgres smgr expects.
let n_blocks = n_blocks.unwrap_or(0);
let n_blocks = timeline.get_rel_size(req.rel, lsn)?;
Ok(PagestreamBeMessage::Nblocks(PagestreamNblocksResponse {
n_blocks,
}))
}
fn handle_get_page_at_lsn_request(
fn handle_get_page_at_lsn_request<R: Repository>(
&self,
timeline: &dyn Timeline,
timeline: &DatadirTimeline<R>,
req: &PagestreamGetPageRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_page", rel = %req.rel, blkno = &req.blkno, req_lsn = %req.lsn)
.entered();
let tag = RelishTag::Relation(req.rel);
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?;
/*
// Add a 1s delay to some requests. The delayed causes the requests to
@@ -502,7 +498,7 @@ impl PageServerHandler {
std::thread::sleep(std::time::Duration::from_millis(1000));
}
*/
let page = timeline.get_page_at_lsn(tag, req.blkno, lsn)?;
let page = timeline.get_rel_page_at_lsn(req.rel, req.blkno, lsn)?;
Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse {
page,
@@ -518,11 +514,12 @@ impl PageServerHandler {
) -> anyhow::Result<()> {
let span = info_span!("basebackup", timeline = %timelineid, tenant = %tenantid, lsn = field::Empty);
let _enter = span.enter();
info!("starting");
// check that the timeline exists
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Cannot handle basebackup request for a remote timeline")?;
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn();
if let Some(lsn) = lsn {
timeline
.check_lsn_is_in_scope(lsn, &latest_gc_cutoff_lsn)
@@ -540,7 +537,7 @@ impl PageServerHandler {
basebackup.send_tarball()?;
}
pgb.write_message(&BeMessage::CopyDone)?;
debug!("CopyDone sent!");
info!("done");
Ok(())
}
@@ -574,7 +571,6 @@ impl postgres_backend::Handler for PageServerHandler {
let data = self
.auth
.as_ref()
.as_ref()
.unwrap()
.decode(str::from_utf8(jwt_response)?)?;
@@ -655,8 +651,8 @@ impl postgres_backend::Handler for PageServerHandler {
info_span!("callmemaybe", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Failed to fetch local timeline for callmemaybe requests")?;
tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
walreceiver::launch_wal_receiver(self.conf, tenantid, timelineid, &connstr)?;
@@ -701,70 +697,42 @@ impl postgres_backend::Handler for PageServerHandler {
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"layer_relfiles_total"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_relfiles_not_updated"),
RowDescriptor::int8_col(b"layer_relfiles_needed_as_tombstone"),
RowDescriptor::int8_col(b"layer_relfiles_removed"),
RowDescriptor::int8_col(b"layer_relfiles_dropped"),
RowDescriptor::int8_col(b"layer_nonrelfiles_total"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_nonrelfiles_not_updated"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_as_tombstone"),
RowDescriptor::int8_col(b"layer_nonrelfiles_removed"),
RowDescriptor::int8_col(b"layer_nonrelfiles_dropped"),
RowDescriptor::int8_col(b"layers_total"),
RowDescriptor::int8_col(b"layers_needed_by_cutoff"),
RowDescriptor::int8_col(b"layers_needed_by_branches"),
RowDescriptor::int8_col(b"layers_not_updated"),
RowDescriptor::int8_col(b"layers_removed"),
RowDescriptor::int8_col(b"elapsed"),
]))?
.write_message_noflush(&BeMessage::DataRow(&[
Some(result.ondisk_relfiles_total.to_string().as_bytes()),
Some(
result
.ondisk_relfiles_needed_by_cutoff
.to_string()
.as_bytes(),
),
Some(
result
.ondisk_relfiles_needed_by_branches
.to_string()
.as_bytes(),
),
Some(result.ondisk_relfiles_not_updated.to_string().as_bytes()),
Some(
result
.ondisk_relfiles_needed_as_tombstone
.to_string()
.as_bytes(),
),
Some(result.ondisk_relfiles_removed.to_string().as_bytes()),
Some(result.ondisk_relfiles_dropped.to_string().as_bytes()),
Some(result.ondisk_nonrelfiles_total.to_string().as_bytes()),
Some(
result
.ondisk_nonrelfiles_needed_by_cutoff
.to_string()
.as_bytes(),
),
Some(
result
.ondisk_nonrelfiles_needed_by_branches
.to_string()
.as_bytes(),
),
Some(result.ondisk_nonrelfiles_not_updated.to_string().as_bytes()),
Some(
result
.ondisk_nonrelfiles_needed_as_tombstone
.to_string()
.as_bytes(),
),
Some(result.ondisk_nonrelfiles_removed.to_string().as_bytes()),
Some(result.ondisk_nonrelfiles_dropped.to_string().as_bytes()),
Some(result.layers_total.to_string().as_bytes()),
Some(result.layers_needed_by_cutoff.to_string().as_bytes()),
Some(result.layers_needed_by_branches.to_string().as_bytes()),
Some(result.layers_not_updated.to_string().as_bytes()),
Some(result.layers_removed.to_string().as_bytes()),
Some(result.elapsed.as_millis().to_string().as_bytes()),
]))?
.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("compact ") {
// Run compaction immediately on given timeline.
// FIXME This is just for tests. Don't expect this to be exposed to
// the users or the api.
// compact <tenant_id> <timeline_id>
let re = Regex::new(r"^compact ([[:xdigit:]]+)\s([[:xdigit:]]+)($|\s)?").unwrap();
let caps = re
.captures(query_string)
.with_context(|| format!("Invalid compact: '{}'", query_string))?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
.context("Couldn't load timeline")?;
timeline.tline.compact()?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("checkpoint ") {
// Run checkpoint immediately on given timeline.
@@ -778,10 +746,17 @@ impl postgres_backend::Handler for PageServerHandler {
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)
.context("Failed to fetch local timeline for checkpoint request")?;
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
timeline.tline.checkpoint(CheckpointConfig::Forced)?;
// Also compact it.
//
// FIXME: This probably shouldn't be part of a "checkpoint" command, but a
// separate operation. Update the tests if you change this.
timeline.tline.compact()?;
timeline.checkpoint(CheckpointConfig::Forced)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else {

File diff suppressed because it is too large Load Diff

View File

@@ -1,226 +0,0 @@
//!
//! Zenith stores PostgreSQL relations, and some other files, in the
//! repository. The relations (i.e. tables and indexes) take up most
//! of the space in a typical installation, while the other files are
//! small. We call each relation and other file that is stored in the
//! repository a "relish". It comes from "rel"-ish, as in "kind of a
//! rel", because it covers relations as well as other things that are
//! not relations, but are treated similarly for the purposes of the
//! storage layer.
//!
//! This source file contains the definition of the RelishTag struct,
//! which uniquely identifies a relish.
//!
//! Relishes come in two flavors: blocky and non-blocky. Relations and
//! SLRUs are blocky, that is, they are divided into 8k blocks, and
//! the repository tracks their size. Other relishes are non-blocky:
//! the content of the whole relish is stored as one blob. Block
//! number must be passed as 0 for all operations on a non-blocky
//! relish. The one "block" that you store in a non-blocky relish can
//! have arbitrary size, but they are expected to be small, or you
//! will have performance issues.
//!
//! All relishes are versioned by LSN in the repository.
//!
use serde::{Deserialize, Serialize};
use std::fmt;
use postgres_ffi::relfile_utils::forknumber_to_name;
use postgres_ffi::{Oid, TransactionId};
///
/// RelishTag identifies one relish.
///
#[derive(Debug, Clone, Copy, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)]
pub enum RelishTag {
// Relations correspond to PostgreSQL relation forks. Each
// PostgreSQL relation fork is considered a separate relish.
Relation(RelTag),
// SLRUs include pg_clog, pg_multixact/members, and
// pg_multixact/offsets. There are other SLRUs in PostgreSQL, but
// they don't need to be stored permanently (e.g. pg_subtrans),
// or we do not support them in zenith yet (pg_commit_ts).
//
// These are currently never requested directly by the compute
// nodes, although in principle that would be possible. However,
// when a new compute node is created, these are included in the
// tarball that we send to the compute node to initialize the
// PostgreSQL data directory.
//
// Each SLRU segment in PostgreSQL is considered a separate
// relish. For example, pg_clog/0000, pg_clog/0001, and so forth.
//
// SLRU segments are divided into blocks, like relations.
Slru { slru: SlruKind, segno: u32 },
// Miscellaneous other files that need to be included in the
// tarball at compute node creation. These are non-blocky, and are
// expected to be small.
//
// FileNodeMap represents PostgreSQL's 'pg_filenode.map'
// files. They are needed to map catalog table OIDs to filenode
// numbers. Usually the mapping is done by looking up a relation's
// 'relfilenode' field in the 'pg_class' system table, but that
// doesn't work for 'pg_class' itself and a few other such system
// relations. See PostgreSQL relmapper.c for details.
//
// Each database has a map file for its local mapped catalogs,
// and there is a separate map file for shared catalogs.
//
// These files are always 512 bytes long (although we don't check
// or care about that in the page server).
//
FileNodeMap { spcnode: Oid, dbnode: Oid },
//
// State files for prepared transactions (e.g pg_twophase/1234)
//
TwoPhase { xid: TransactionId },
// The control file, stored in global/pg_control
ControlFile,
// Special entry that represents PostgreSQL checkpoint. It doesn't
// correspond to to any physical file in PostgreSQL, but we use it
// to track fields needed to restore the checkpoint data in the
// control file, when a compute node is created.
Checkpoint,
}
impl RelishTag {
pub const fn is_blocky(&self) -> bool {
match self {
// These relishes work with blocks
RelishTag::Relation(_) | RelishTag::Slru { slru: _, segno: _ } => true,
// and these don't
RelishTag::FileNodeMap {
spcnode: _,
dbnode: _,
}
| RelishTag::TwoPhase { xid: _ }
| RelishTag::ControlFile
| RelishTag::Checkpoint => false,
}
}
// Physical relishes represent files and use
// RelationSizeEntry to track existing and dropped files.
// They can be both blocky and non-blocky.
pub const fn is_physical(&self) -> bool {
match self {
// These relishes represent physical files
RelishTag::Relation(_)
| RelishTag::Slru { .. }
| RelishTag::FileNodeMap { .. }
| RelishTag::TwoPhase { .. } => true,
// and these don't
RelishTag::ControlFile | RelishTag::Checkpoint => false,
}
}
// convenience function to check if this relish is a normal relation.
pub const fn is_relation(&self) -> bool {
matches!(self, RelishTag::Relation(_))
}
}
///
/// Relation data file segment id throughout the Postgres cluster.
///
/// Every data file in Postgres is uniquely identified by 4 numbers:
/// - relation id / node (`relnode`)
/// - database id (`dbnode`)
/// - tablespace id (`spcnode`), in short this is a unique id of a separate
/// directory to store data files.
/// - forknumber (`forknum`) is used to split different kinds of data of the same relation
/// between some set of files (`relnode`, `relnode_fsm`, `relnode_vm`).
///
/// In native Postgres code `RelFileNode` structure and individual `ForkNumber` value
/// are used for the same purpose.
/// [See more related comments here](https:///github.com/postgres/postgres/blob/99c5852e20a0987eca1c38ba0c09329d4076b6a0/src/include/storage/relfilenode.h#L57).
///
#[derive(Debug, PartialEq, Eq, PartialOrd, Hash, Ord, Clone, Copy, Serialize, Deserialize)]
pub struct RelTag {
pub forknum: u8,
pub spcnode: Oid,
pub dbnode: Oid,
pub relnode: Oid,
}
/// Display RelTag in the same format that's used in most PostgreSQL debug messages:
///
/// <spcnode>/<dbnode>/<relnode>[_fsm|_vm|_init]
///
impl fmt::Display for RelTag {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
if let Some(forkname) = forknumber_to_name(self.forknum) {
write!(
f,
"{}/{}/{}_{}",
self.spcnode, self.dbnode, self.relnode, forkname
)
} else {
write!(f, "{}/{}/{}", self.spcnode, self.dbnode, self.relnode)
}
}
}
/// Display RelTag in the same format that's used in most PostgreSQL debug messages:
///
/// <spcnode>/<dbnode>/<relnode>[_fsm|_vm|_init]
///
impl fmt::Display for RelishTag {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
RelishTag::Relation(rel) => rel.fmt(f),
RelishTag::Slru { slru, segno } => {
// e.g. pg_clog/0001
write!(f, "{}/{:04X}", slru.to_str(), segno)
}
RelishTag::FileNodeMap { spcnode, dbnode } => {
write!(f, "relmapper file for spc {} db {}", spcnode, dbnode)
}
RelishTag::TwoPhase { xid } => {
write!(f, "pg_twophase/{:08X}", xid)
}
RelishTag::ControlFile => {
write!(f, "control file")
}
RelishTag::Checkpoint => {
write!(f, "checkpoint")
}
}
}
}
///
/// Non-relation transaction status files (clog (a.k.a. pg_xact) and
/// pg_multixact) in Postgres are handled by SLRU (Simple LRU) buffer,
/// hence the name.
///
/// These files are global for a postgres instance.
///
/// These files are divided into segments, which are divided into
/// pages of the same BLCKSZ as used for relation files.
///
#[derive(Debug, Clone, Copy, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)]
pub enum SlruKind {
Clog,
MultiXactMembers,
MultiXactOffsets,
}
impl SlruKind {
pub fn to_str(&self) -> &'static str {
match self {
Self::Clog => "pg_xact",
Self::MultiXactMembers => "pg_multixact/members",
Self::MultiXactOffsets => "pg_multixact/offsets",
}
}
}

103
pageserver/src/reltag.rs Normal file
View File

@@ -0,0 +1,103 @@
use serde::{Deserialize, Serialize};
use std::cmp::Ordering;
use std::fmt;
use postgres_ffi::relfile_utils::forknumber_to_name;
use postgres_ffi::Oid;
///
/// Relation data file segment id throughout the Postgres cluster.
///
/// Every data file in Postgres is uniquely identified by 4 numbers:
/// - relation id / node (`relnode`)
/// - database id (`dbnode`)
/// - tablespace id (`spcnode`), in short this is a unique id of a separate
/// directory to store data files.
/// - forknumber (`forknum`) is used to split different kinds of data of the same relation
/// between some set of files (`relnode`, `relnode_fsm`, `relnode_vm`).
///
/// In native Postgres code `RelFileNode` structure and individual `ForkNumber` value
/// are used for the same purpose.
/// [See more related comments here](https:///github.com/postgres/postgres/blob/99c5852e20a0987eca1c38ba0c09329d4076b6a0/src/include/storage/relfilenode.h#L57).
///
// FIXME: should move 'forknum' as last field to keep this consistent with Postgres.
// Then we could replace the custo Ord and PartialOrd implementations below with
// deriving them.
#[derive(Debug, PartialEq, Eq, Hash, Clone, Copy, Serialize, Deserialize)]
pub struct RelTag {
pub forknum: u8,
pub spcnode: Oid,
pub dbnode: Oid,
pub relnode: Oid,
}
impl PartialOrd for RelTag {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for RelTag {
fn cmp(&self, other: &Self) -> Ordering {
let mut cmp = self.spcnode.cmp(&other.spcnode);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.dbnode.cmp(&other.dbnode);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.relnode.cmp(&other.relnode);
if cmp != Ordering::Equal {
return cmp;
}
cmp = self.forknum.cmp(&other.forknum);
cmp
}
}
/// Display RelTag in the same format that's used in most PostgreSQL debug messages:
///
/// <spcnode>/<dbnode>/<relnode>[_fsm|_vm|_init]
///
impl fmt::Display for RelTag {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
if let Some(forkname) = forknumber_to_name(self.forknum) {
write!(
f,
"{}/{}/{}_{}",
self.spcnode, self.dbnode, self.relnode, forkname
)
} else {
write!(f, "{}/{}/{}", self.spcnode, self.dbnode, self.relnode)
}
}
}
///
/// Non-relation transaction status files (clog (a.k.a. pg_xact) and
/// pg_multixact) in Postgres are handled by SLRU (Simple LRU) buffer,
/// hence the name.
///
/// These files are global for a postgres instance.
///
/// These files are divided into segments, which are divided into
/// pages of the same BLCKSZ as used for relation files.
///
#[derive(Debug, Clone, Copy, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)]
pub enum SlruKind {
Clog,
MultiXactMembers,
MultiXactOffsets,
}
impl SlruKind {
pub fn to_str(&self) -> &'static str {
match self {
Self::Clog => "pg_xact",
Self::MultiXactMembers => "pg_multixact/members",
Self::MultiXactOffsets => "pg_multixact/offsets",
}
}
}

View File

@@ -5,7 +5,7 @@
//! There are a few components the storage machinery consists of:
//! * [`RemoteStorage`] trait a CRUD-like generic abstraction to use for adapting external storages with a few implementations:
//! * [`local_fs`] allows to use local file system as an external storage
//! * [`rust_s3`] uses AWS S3 bucket as an external storage
//! * [`s3_bucket`] uses AWS S3 bucket as an external storage
//!
//! * synchronization logic at [`storage_sync`] module that keeps pageserver state (both runtime one and the workdir files) and storage state in sync.
//! Synchronization internals are split into submodules
@@ -82,7 +82,7 @@
//! The sync queue processing also happens in batches, so the sync tasks can wait in the queue for some time.
mod local_fs;
mod rust_s3;
mod s3_bucket;
mod storage_sync;
use std::{
@@ -93,28 +93,34 @@ use std::{
use anyhow::{bail, Context};
use tokio::io;
use tracing::{error, info};
use tracing::{debug, error, info};
use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId};
pub use self::storage_sync::index::{RemoteIndex, TimelineIndexEntry};
pub use self::storage_sync::{schedule_timeline_checkpoint_upload, schedule_timeline_download};
use self::{local_fs::LocalFs, rust_s3::S3};
use self::{local_fs::LocalFs, s3_bucket::S3Bucket};
use crate::layered_repository::ephemeral_file::is_ephemeral_file;
use crate::{
config::{PageServerConf, RemoteStorageKind},
layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME},
repository::TimelineSyncState,
};
pub use storage_sync::compression;
#[derive(Clone, Copy, Debug)]
pub enum LocalTimelineInitStatus {
LocallyComplete,
NeedsSync,
}
type LocalTimelineInitStatuses = HashMap<ZTenantId, HashMap<ZTimelineId, LocalTimelineInitStatus>>;
/// A structure to combine all synchronization data to share with pageserver after a successful sync loop initialization.
/// Successful initialization includes a case when sync loop is not started, in which case the startup data is returned still,
/// to simplify the received code.
pub struct SyncStartupData {
/// A sync state, derived from initial comparison of local timeline files and the remote archives,
/// before any sync tasks are executed.
/// To reuse the local file scan logic, the timeline states are returned even if no sync loop get started during init:
/// in this case, no remote files exist and all local timelines with correct metadata files are considered ready.
pub initial_timeline_states: HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>>,
pub remote_index: RemoteIndex,
pub local_timeline_init_statuses: LocalTimelineInitStatuses,
}
/// Based on the config, initiates the remote storage connection and starts a separate thread
@@ -145,7 +151,7 @@ pub fn start_local_timeline_sync(
storage_sync::spawn_storage_sync_thread(
config,
local_timeline_files,
S3::new(s3_config, &config.workdir)?,
S3Bucket::new(s3_config, &config.workdir)?,
storage_config.max_concurrent_sync,
storage_config.max_sync_errors,
)
@@ -154,23 +160,18 @@ pub fn start_local_timeline_sync(
.context("Failed to spawn the storage sync thread"),
None => {
info!("No remote storage configured, skipping storage sync, considering all local timelines with correct metadata files enabled");
let mut initial_timeline_states: HashMap<
ZTenantId,
HashMap<ZTimelineId, TimelineSyncState>,
> = HashMap::new();
for (ZTenantTimelineId{tenant_id, timeline_id}, (timeline_metadata, _)) in
let mut local_timeline_init_statuses = LocalTimelineInitStatuses::new();
for (ZTenantTimelineId { tenant_id, timeline_id }, _) in
local_timeline_files
{
initial_timeline_states
local_timeline_init_statuses
.entry(tenant_id)
.or_default()
.insert(
timeline_id,
TimelineSyncState::Ready(timeline_metadata.disk_consistent_lsn()),
);
.insert(timeline_id, LocalTimelineInitStatus::LocallyComplete);
}
Ok(SyncStartupData {
initial_timeline_states,
local_timeline_init_statuses,
remote_index: RemoteIndex::empty(),
})
}
}
@@ -260,6 +261,8 @@ fn collect_timelines_for_tenant(
Ok(timelines)
}
// discover timeline files and extract timeline metadata
// NOTE: ephemeral files are excluded from the list
fn collect_timeline_files(
timeline_dir: &Path,
) -> anyhow::Result<(ZTimelineId, TimelineMetadata, Vec<PathBuf>)> {
@@ -279,6 +282,9 @@ fn collect_timeline_files(
if entry_path.is_file() {
if entry_path.file_name().and_then(ffi::OsStr::to_str) == Some(METADATA_FILE_NAME) {
timeline_metadata_path = Some(entry_path);
} else if is_ephemeral_file(&entry_path.file_name().unwrap().to_string_lossy()) {
debug!("skipping ephemeral file {}", entry_path.display());
continue;
} else {
timeline_files.push(entry_path);
}
@@ -319,27 +325,35 @@ trait RemoteStorage: Send + Sync {
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<()>;
/// Streams the remote storage entry contents into the buffered writer given, returns the filled writer.
/// Returns the metadata, if any was stored with the file previously.
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()>;
) -> anyhow::Result<Option<StorageMetadata>>;
/// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer.
/// Returns the metadata, if any was stored with the file previously.
async fn download_range(
&self,
from: &Self::StoragePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()>;
) -> anyhow::Result<Option<StorageMetadata>>;
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()>;
}
/// Extra set of key-value pairs that contain arbitrary metadata about the storage entry.
/// Immutable, cannot be changed once the file is created.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct StorageMetadata(HashMap<String, String>);
fn strip_path_prefix<'a>(prefix: &'a Path, path: &'a Path) -> anyhow::Result<&'a Path> {
if prefix == path {
anyhow::bail!(

View File

@@ -17,7 +17,7 @@ This way, the backups are managed in background, not affecting directly other pa
Current implementation
* provides remote storage wrappers for AWS S3 and local FS
* synchronizes the differences with local timelines and remote states as fast as possible
* uploads new relishes, frozen by pageserver checkpoint thread
* uploads new layer files
* downloads and registers timelines, found on the remote storage, but missing locally, if those are requested somehow via pageserver (e.g. http api, gc)
* uses compression when deals with files, for better S3 usage
* maintains an index of what's stored remotely
@@ -46,18 +46,6 @@ This could be avoided by a background thread/future storing the serialized index
No file checksum assertion is done currently, but should be (AWS S3 returns file checksums during the `list` operation)
* sad rust-s3 api
rust-s3 is not very pleasant to use:
1. it returns `anyhow::Result` and it's hard to distinguish "missing file" cases from "no connection" one, for instance
2. at least one function it its API that we need (`get_object_stream`) has `async` keyword and blocks (!), see details [here](https://github.com/zenithdb/zenith/pull/752#discussion_r728373091)
3. it's a prerelease library with unclear maintenance status
4. noisy on debug level
But it's already used in the project, so for now it's reused to avoid bloating the dependency tree.
Based on previous evaluation, even `rusoto-s3` could be a better choice over this library, but needs further benchmarking.
* gc is ignored
So far, we don't adjust the remote storage based on GC thread loop results, only checkpointer loop affects the remote storage.

View File

@@ -17,7 +17,7 @@ use tokio::{
};
use tracing::*;
use super::{strip_path_prefix, RemoteStorage};
use super::{strip_path_prefix, RemoteStorage, StorageMetadata};
pub struct LocalFs {
pageserver_workdir: &'static Path,
@@ -53,6 +53,32 @@ impl LocalFs {
)
}
}
async fn read_storage_metadata(
&self,
file_path: &Path,
) -> anyhow::Result<Option<StorageMetadata>> {
let metadata_path = storage_metadata_path(file_path);
if metadata_path.exists() && metadata_path.is_file() {
let metadata_string = fs::read_to_string(&metadata_path).await.with_context(|| {
format!(
"Failed to read metadata from the local storage at '{}'",
metadata_path.display()
)
})?;
serde_json::from_str(&metadata_string)
.with_context(|| {
format!(
"Failed to deserialize metadata from the local storage at '{}'",
metadata_path.display()
)
})
.map(|metadata| Some(StorageMetadata(metadata)))
} else {
Ok(None)
}
}
}
#[async_trait::async_trait]
@@ -80,14 +106,19 @@ impl RemoteStorage for LocalFs {
&self,
mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<()> {
let target_file_path = self.resolve_in_storage(to)?;
create_target_directory(&target_file_path).await?;
// We need this dance with sort of durable rename (without fsyncs)
// to prevent partial uploads. This was really hit when pageserver shutdown
// cancelled the upload and partial file was left on the fs
let temp_file_path = path_with_suffix_extension(&target_file_path, ".temp");
let mut destination = io::BufWriter::new(
fs::OpenOptions::new()
.write(true)
.create(true)
.open(&target_file_path)
.open(&temp_file_path)
.await
.with_context(|| {
format!(
@@ -101,16 +132,43 @@ impl RemoteStorage for LocalFs {
.await
.with_context(|| {
format!(
"Failed to upload file to the local storage at '{}'",
"Failed to upload file (write temp) to the local storage at '{}'",
temp_file_path.display()
)
})?;
destination.flush().await.with_context(|| {
format!(
"Failed to upload (flush temp) file to the local storage at '{}'",
temp_file_path.display()
)
})?;
fs::rename(temp_file_path, &target_file_path)
.await
.with_context(|| {
format!(
"Failed to upload (rename) file to the local storage at '{}'",
target_file_path.display()
)
})?;
destination.flush().await.with_context(|| {
format!(
"Failed to upload file to the local storage at '{}'",
target_file_path.display()
if let Some(storage_metadata) = metadata {
let storage_metadata_path = storage_metadata_path(&target_file_path);
fs::write(
&storage_metadata_path,
serde_json::to_string(&storage_metadata.0)
.context("Failed to serialize storage metadata as json")?,
)
})?;
.await
.with_context(|| {
format!(
"Failed to write metadata to the local storage at '{}'",
storage_metadata_path.display()
)
})?;
}
Ok(())
}
@@ -118,7 +176,7 @@ impl RemoteStorage for LocalFs {
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
) -> anyhow::Result<Option<StorageMetadata>> {
let file_path = self.resolve_in_storage(from)?;
if file_path.exists() && file_path.is_file() {
@@ -141,7 +199,8 @@ impl RemoteStorage for LocalFs {
)
})?;
source.flush().await?;
Ok(())
self.read_storage_metadata(&file_path).await
} else {
bail!(
"File '{}' either does not exist or is not a file",
@@ -156,7 +215,7 @@ impl RemoteStorage for LocalFs {
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
) -> anyhow::Result<Option<StorageMetadata>> {
if let Some(end_exclusive) = end_exclusive {
ensure!(
end_exclusive > start_inclusive,
@@ -165,7 +224,7 @@ impl RemoteStorage for LocalFs {
end_exclusive
);
if start_inclusive == end_exclusive.saturating_sub(1) {
return Ok(());
return Ok(None);
}
}
let file_path = self.resolve_in_storage(from)?;
@@ -199,7 +258,8 @@ impl RemoteStorage for LocalFs {
file_path.display()
)
})?;
Ok(())
self.read_storage_metadata(&file_path).await
} else {
bail!(
"File '{}' either does not exist or is not a file",
@@ -221,6 +281,17 @@ impl RemoteStorage for LocalFs {
}
}
fn path_with_suffix_extension(original_path: &Path, suffix: &str) -> PathBuf {
let mut extension_with_suffix = original_path.extension().unwrap_or_default().to_os_string();
extension_with_suffix.push(suffix);
original_path.with_extension(extension_with_suffix)
}
fn storage_metadata_path(original_path: &Path) -> PathBuf {
path_with_suffix_extension(original_path, ".metadata")
}
fn get_all_files<'a, P>(
directory_path: P,
) -> Pin<Box<dyn Future<Output = anyhow::Result<Vec<PathBuf>>> + Send + Sync + 'a>>
@@ -430,7 +501,7 @@ mod fs_tests {
use super::*;
use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID};
use std::io::Write;
use std::{collections::HashMap, io::Write};
use tempfile::tempdir;
#[tokio::test]
@@ -444,7 +515,7 @@ mod fs_tests {
)
.await?;
let target_path = PathBuf::from("/").join("somewhere").join("else");
match storage.upload(source, &target_path).await {
match storage.upload(source, &target_path, None).await {
Ok(()) => panic!("Should not allow storing files with wrong target path"),
Err(e) => {
let message = format!("{:?}", e);
@@ -454,14 +525,14 @@ mod fs_tests {
}
assert!(storage.list().await?.is_empty());
let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1").await?;
let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1", None).await?;
assert_eq!(
storage.list().await?,
vec![target_path_1.clone()],
"Should list a single file after first upload"
);
let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2").await?;
let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2", None).await?;
assert_eq!(
list_files_sorted(&storage).await?,
vec![target_path_1.clone(), target_path_2.clone()],
@@ -482,12 +553,16 @@ mod fs_tests {
let repo_harness = RepoHarness::create("download_file")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage.download(&upload_target, &mut content_bytes).await?;
content_bytes.flush().await?;
let metadata = storage.download(&upload_target, &mut content_bytes).await?;
assert!(
metadata.is_none(),
"No metadata should be returned for no metadata upload"
);
content_bytes.flush().await?;
let contents = String::from_utf8(content_bytes.into_inner().into_inner())?;
assert_eq!(
dummy_contents(upload_name),
@@ -512,12 +587,16 @@ mod fs_tests {
let repo_harness = RepoHarness::create("download_file_range_positive")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
let mut full_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
let metadata = storage
.download_range(&upload_target, 0, None, &mut full_range_bytes)
.await?;
assert!(
metadata.is_none(),
"No metadata should be returned for no metadata upload"
);
full_range_bytes.flush().await?;
assert_eq!(
dummy_contents(upload_name),
@@ -527,7 +606,7 @@ mod fs_tests {
let mut zero_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let same_byte = 1_000_000_000;
storage
let metadata = storage
.download_range(
&upload_target,
same_byte,
@@ -535,6 +614,10 @@ mod fs_tests {
&mut zero_range_bytes,
)
.await?;
assert!(
metadata.is_none(),
"No metadata should be returned for no metadata upload"
);
zero_range_bytes.flush().await?;
assert!(
zero_range_bytes.into_inner().into_inner().is_empty(),
@@ -545,7 +628,7 @@ mod fs_tests {
let (first_part_local, second_part_local) = uploaded_bytes.split_at(3);
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
let metadata = storage
.download_range(
&upload_target,
0,
@@ -553,6 +636,11 @@ mod fs_tests {
&mut first_part_remote,
)
.await?;
assert!(
metadata.is_none(),
"No metadata should be returned for no metadata upload"
);
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
assert_eq!(
@@ -562,7 +650,7 @@ mod fs_tests {
);
let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
let metadata = storage
.download_range(
&upload_target,
first_part_local.len() as u64,
@@ -570,6 +658,11 @@ mod fs_tests {
&mut second_part_remote,
)
.await?;
assert!(
metadata.is_none(),
"No metadata should be returned for no metadata upload"
);
second_part_remote.flush().await?;
let second_part_remote = second_part_remote.into_inner().into_inner();
assert_eq!(
@@ -586,7 +679,7 @@ mod fs_tests {
let repo_harness = RepoHarness::create("download_file_range_negative")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
let start = 10000;
let end = 234;
@@ -624,7 +717,7 @@ mod fs_tests {
let repo_harness = RepoHarness::create("delete_file")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
storage.delete(&upload_target).await?;
assert!(storage.list().await?.is_empty());
@@ -640,10 +733,69 @@ mod fs_tests {
Ok(())
}
#[tokio::test]
async fn file_with_metadata() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let metadata = StorageMetadata(HashMap::from([
("one".to_string(), "1".to_string()),
("two".to_string(), "2".to_string()),
]));
let upload_target =
upload_dummy_file(&repo_harness, &storage, upload_name, Some(metadata.clone())).await?;
let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let full_download_metadata = storage.download(&upload_target, &mut content_bytes).await?;
content_bytes.flush().await?;
let contents = String::from_utf8(content_bytes.into_inner().into_inner())?;
assert_eq!(
dummy_contents(upload_name),
contents,
"We should upload and download the same contents"
);
assert_eq!(
full_download_metadata.as_ref(),
Some(&metadata),
"We should get the same metadata back for full download"
);
let uploaded_bytes = dummy_contents(upload_name).into_bytes();
let (first_part_local, _) = uploaded_bytes.split_at(3);
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let partial_download_metadata = storage
.download_range(
&upload_target,
0,
Some(first_part_local.len() as u64),
&mut first_part_remote,
)
.await?;
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
assert_eq!(
first_part_local,
first_part_remote.as_slice(),
"First part bytes should be returned when requested"
);
assert_eq!(
partial_download_metadata.as_ref(),
Some(&metadata),
"We should get the same metadata back for partial download"
);
Ok(())
}
async fn upload_dummy_file(
harness: &RepoHarness,
harness: &RepoHarness<'_>,
storage: &LocalFs,
name: &str,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<PathBuf> {
let timeline_path = harness.timeline_path(&TIMELINE_ID);
let relative_timeline_path = timeline_path.strip_prefix(&harness.conf.workdir)?;
@@ -656,6 +808,7 @@ mod fs_tests {
)
.await?,
&storage_path,
metadata,
)
.await?;
Ok(storage_path)

View File

@@ -1,4 +1,4 @@
//! AWS S3 storage wrapper around `rust_s3` library.
//! AWS S3 storage wrapper around `rusoto` library.
//!
//! Respects `prefix_in_bucket` property from [`S3Config`],
//! allowing multiple pageservers to independently work with the same S3 bucket, if
@@ -7,15 +7,25 @@
use std::path::{Path, PathBuf};
use anyhow::Context;
use s3::{bucket::Bucket, creds::Credentials, region::Region};
use tokio::io::{self, AsyncWriteExt};
use tracing::debug;
use rusoto_core::{
credential::{InstanceMetadataProvider, StaticProvider},
HttpClient, Region,
};
use rusoto_s3::{
DeleteObjectRequest, GetObjectRequest, ListObjectsV2Request, PutObjectRequest, S3Client,
StreamingBody, S3,
};
use tokio::io;
use tokio_util::io::ReaderStream;
use tracing::{debug, trace};
use crate::{
config::S3Config,
remote_storage::{strip_path_prefix, RemoteStorage},
};
use super::StorageMetadata;
const S3_FILE_SEPARATOR: char = '/';
#[derive(Debug, Eq, PartialEq)]
@@ -50,38 +60,50 @@ impl S3ObjectKey {
}
/// AWS S3 storage.
pub struct S3 {
pub struct S3Bucket {
pageserver_workdir: &'static Path,
bucket: Bucket,
client: S3Client,
bucket_name: String,
prefix_in_bucket: Option<String>,
}
impl S3 {
/// Creates the storage, errors if incorrect AWS S3 configuration provided.
impl S3Bucket {
/// Creates the S3 storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config, pageserver_workdir: &'static Path) -> anyhow::Result<Self> {
// TODO kb check this
// Keeping a single client may cause issues due to timeouts.
// https://github.com/rusoto/rusoto/issues/1686
debug!(
"Creating s3 remote storage around bucket {}",
"Creating s3 remote storage for S3 bucket {}",
aws_config.bucket_name
);
let region = match aws_config.endpoint.clone() {
Some(endpoint) => Region::Custom {
endpoint,
region: aws_config.bucket_region.clone(),
Some(custom_endpoint) => Region::Custom {
name: aws_config.bucket_region.clone(),
endpoint: custom_endpoint,
},
None => aws_config
.bucket_region
.parse::<Region>()
.context("Failed to parse the s3 region from config")?,
};
let credentials = Credentials::new(
aws_config.access_key_id.as_deref(),
aws_config.secret_access_key.as_deref(),
None,
None,
None,
)
.context("Failed to create the s3 credentials")?;
let request_dispatcher = HttpClient::new().context("Failed to create S3 http client")?;
let client = if aws_config.access_key_id.is_none() && aws_config.secret_access_key.is_none()
{
trace!("Using IAM-based AWS access");
S3Client::new_with(request_dispatcher, InstanceMetadataProvider::new(), region)
} else {
trace!("Using credentials-based AWS access");
S3Client::new_with(
request_dispatcher,
StaticProvider::new_minimal(
aws_config.access_key_id.clone().unwrap_or_default(),
aws_config.secret_access_key.clone().unwrap_or_default(),
),
region,
)
};
let prefix_in_bucket = aws_config.prefix_in_bucket.as_deref().map(|prefix| {
let mut prefix = prefix;
@@ -97,20 +119,16 @@ impl S3 {
});
Ok(Self {
bucket: Bucket::new_with_path_style(
aws_config.bucket_name.as_str(),
region,
credentials,
)
.context("Failed to create the s3 bucket")?,
client,
pageserver_workdir,
bucket_name: aws_config.bucket_name.clone(),
prefix_in_bucket,
})
}
}
#[async_trait::async_trait]
impl RemoteStorage for S3 {
impl RemoteStorage for S3Bucket {
type StoragePath = S3ObjectKey;
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath> {
@@ -129,74 +147,74 @@ impl RemoteStorage for S3 {
}
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>> {
let list_response = self
.bucket
.list(self.prefix_in_bucket.clone().unwrap_or_default(), None)
.await
.context("Failed to list s3 objects")?;
let mut document_keys = Vec::new();
Ok(list_response
.into_iter()
.flat_map(|response| response.contents)
.map(|s3_object| S3ObjectKey(s3_object.key))
.collect())
let mut continuation_token = None;
loop {
let fetch_response = self
.client
.list_objects_v2(ListObjectsV2Request {
bucket: self.bucket_name.clone(),
prefix: self.prefix_in_bucket.clone(),
continuation_token,
..ListObjectsV2Request::default()
})
.await?;
document_keys.extend(
fetch_response
.contents
.unwrap_or_default()
.into_iter()
.filter_map(|o| Some(S3ObjectKey(o.key?))),
);
match fetch_response.continuation_token {
Some(new_token) => continuation_token = Some(new_token),
None => break,
}
}
Ok(document_keys)
}
async fn upload(
&self,
mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<()> {
let mut upload_contents = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
io::copy(&mut from, &mut upload_contents)
.await
.context("Failed to read the upload contents")?;
upload_contents
.flush()
.await
.context("Failed to read the upload contents")?;
let upload_contents = upload_contents.into_inner().into_inner();
let (_, code) = self
.bucket
.put_object(to.key(), &upload_contents)
.await
.with_context(|| format!("Failed to create s3 object with key {}", to.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during creating object with key '{}', code: {}",
to.key(),
code
))
} else {
Ok(())
}
self.client
.put_object(PutObjectRequest {
body: Some(StreamingBody::new(ReaderStream::new(from))),
bucket: self.bucket_name.clone(),
key: to.key().to_owned(),
metadata: metadata.map(|m| m.0),
..PutObjectRequest::default()
})
.await?;
Ok(())
}
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
let (data, code) = self
.bucket
.get_object(from.key())
.await
.with_context(|| format!("Failed to download s3 object with key {}", from.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during downloading object, code: {}",
code
))
} else {
// we don't have to write vector into the destination this way, `to_write_all` would be enough.
// but we want to prepare for migration on `rusoto`, that has a streaming HTTP body instead here, with
// which it makes more sense to use `io::copy`.
io::copy(&mut data.as_slice(), to)
.await
.context("Failed to write downloaded data into the destination buffer")?;
Ok(())
) -> anyhow::Result<Option<StorageMetadata>> {
let object_output = self
.client
.get_object(GetObjectRequest {
bucket: self.bucket_name.clone(),
key: from.key().to_owned(),
..GetObjectRequest::default()
})
.await?;
if let Some(body) = object_output.body {
let mut from = io::BufReader::new(body.into_async_read());
io::copy(&mut from, to).await?;
}
Ok(object_output.metadata.map(StorageMetadata))
}
async fn download_range(
@@ -205,44 +223,41 @@ impl RemoteStorage for S3 {
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
) -> anyhow::Result<Option<StorageMetadata>> {
// S3 accepts ranges as https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
// and needs both ends to be exclusive
let end_inclusive = end_exclusive.map(|end| end.saturating_sub(1));
let (data, code) = self
.bucket
.get_object_range(from.key(), start_inclusive, end_inclusive)
.await
.with_context(|| format!("Failed to download s3 object with key {}", from.key()))?;
if code != 206 {
Err(anyhow::format_err!(
"Received non-206 exit code during downloading object range, code: {}",
code
))
} else {
// see `download` function above for the comment on why `Vec<u8>` buffer is copied this way
io::copy(&mut data.as_slice(), to)
.await
.context("Failed to write downloaded range into the destination buffer")?;
Ok(())
let range = Some(match end_inclusive {
Some(end_inclusive) => format!("bytes={}-{}", start_inclusive, end_inclusive),
None => format!("bytes={}-", start_inclusive),
});
let object_output = self
.client
.get_object(GetObjectRequest {
bucket: self.bucket_name.clone(),
key: from.key().to_owned(),
range,
..GetObjectRequest::default()
})
.await?;
if let Some(body) = object_output.body {
let mut from = io::BufReader::new(body.into_async_read());
io::copy(&mut from, to).await?;
}
Ok(object_output.metadata.map(StorageMetadata))
}
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> {
let (_, code) = self
.bucket
.delete_object(path.key())
.await
.with_context(|| format!("Failed to delete s3 object with key {}", path.key()))?;
if code != 204 {
Err(anyhow::format_err!(
"Received non-204 exit code during deleting object with key '{}', code: {}",
path.key(),
code
))
} else {
Ok(())
}
self.client
.delete_object(DeleteObjectRequest {
bucket: self.bucket_name.clone(),
key: path.key().to_owned(),
..DeleteObjectRequest::default()
})
.await?;
Ok(())
}
}
@@ -314,7 +329,7 @@ mod tests {
#[test]
fn storage_path_negatives() -> anyhow::Result<()> {
#[track_caller]
fn storage_path_error(storage: &S3, mismatching_path: &Path) -> String {
fn storage_path_error(storage: &S3Bucket, mismatching_path: &Path) -> String {
match storage.storage_path(mismatching_path) {
Ok(wrong_key) => panic!(
"Expected path '{}' to error, but got S3 key: {:?}",
@@ -412,15 +427,11 @@ mod tests {
Ok(())
}
fn dummy_storage(pageserver_workdir: &'static Path) -> S3 {
S3 {
fn dummy_storage(pageserver_workdir: &'static Path) -> S3Bucket {
S3Bucket {
pageserver_workdir,
bucket: Bucket::new(
"dummy-bucket",
"us-east-1".parse().unwrap(),
Credentials::anonymous().unwrap(),
)
.unwrap(),
client: S3Client::new("us-east-1".parse().unwrap()),
bucket_name: "dummy-bucket".to_string(),
prefix_in_bucket: Some("dummy_prefix/".to_string()),
}
}

View File

@@ -25,8 +25,9 @@
//! * all never local state gets scheduled for upload, such timelines are "local" and fully operational
//! * the rest of the remote timelines are reported to pageserver, but not downloaded before they are actually accessed in pageserver,
//! it may schedule the download on such occasions.
//! Then, the index is shared across pageserver under [`RemoteIndex`] guard to ensure proper synchronization.
//!
//! The synchronization unit is an archive: a set of timeline files (or relishes) and a special metadata file, all compressed into a blob.
//! The synchronization unit is an archive: a set of layer files and a special metadata file, all compressed into a blob.
//! Currently, there's no way to process an archive partially, if the archive processing fails, it has to be started from zero next time again.
//! An archive contains set of files of a certain timeline, added during checkpoint(s) and the timeline metadata at that moment.
//! The archive contains that metadata's `disk_consistent_lsn` in its name, to be able to restore partial index information from just a remote storage file list.
@@ -58,7 +59,7 @@
//! Synchronization never removes any local from pageserver workdir or remote files from the remote storage, yet there could be overwrites of the same files (metadata file updates; future checksum mismatch fixes).
//! NOTE: No real contents or checksum check happens right now and is a subject to improve later.
//!
//! After the whole timeline is downloaded, [`crate::tenant_mgr::set_timeline_states`] function is used to update pageserver memory stage for the timeline processed.
//! After the whole timeline is downloaded, [`crate::tenant_mgr::apply_timeline_sync_status_updates`] function is used to update pageserver memory stage for the timeline processed.
//!
//! When pageserver signals shutdown, current sync task gets finished and the loop exists.
@@ -80,10 +81,7 @@ use futures::stream::{FuturesUnordered, StreamExt};
use lazy_static::lazy_static;
use tokio::{
runtime::Runtime,
sync::{
mpsc::{self, UnboundedReceiver},
RwLock,
},
sync::mpsc::{self, UnboundedReceiver},
time::{Duration, Instant},
};
use tracing::*;
@@ -92,18 +90,26 @@ use self::{
compression::ArchiveHeader,
download::{download_timeline, DownloadedTimeline},
index::{
ArchiveDescription, ArchiveId, RemoteTimeline, RemoteTimelineIndex, TimelineIndexEntry,
ArchiveDescription, ArchiveId, RemoteIndex, RemoteTimeline, RemoteTimelineIndex,
TimelineIndexEntry, TimelineIndexEntryInner,
},
upload::upload_timeline_checkpoint,
};
use super::{RemoteStorage, SyncStartupData, ZTenantTimelineId};
use super::{
LocalTimelineInitStatus, LocalTimelineInitStatuses, RemoteStorage, SyncStartupData,
ZTenantTimelineId,
};
use crate::{
config::PageServerConf, layered_repository::metadata::TimelineMetadata,
remote_storage::storage_sync::compression::read_archive_header, repository::TimelineSyncState,
tenant_mgr::set_timeline_states, thread_mgr, thread_mgr::ThreadKind,
remote_storage::storage_sync::compression::read_archive_header,
repository::TimelineSyncStatusUpdate, tenant_mgr::apply_timeline_sync_status_updates,
thread_mgr, thread_mgr::ThreadKind,
};
use zenith_metrics::{register_histogram_vec, register_int_gauge, HistogramVec, IntGauge};
use zenith_metrics::{
register_histogram_vec, register_int_counter, register_int_gauge, HistogramVec, IntCounter,
IntGauge,
};
use zenith_utils::zid::{ZTenantId, ZTimelineId};
lazy_static! {
@@ -112,6 +118,11 @@ lazy_static! {
"Number of storage sync items left in the queue"
)
.expect("failed to register pageserver remote storage remaining sync items int gauge");
static ref FATAL_TASK_FAILURES: IntCounter = register_int_counter!(
"pageserver_remote_storage_fatal_task_failures",
"Number of critically failed tasks"
)
.expect("failed to register pageserver remote storage remaining sync items int gauge");
static ref IMAGE_SYNC_TIME: HistogramVec = register_histogram_vec!(
"pageserver_remote_storage_image_sync_time",
"Time took to synchronize (download or upload) a whole pageserver image. \
@@ -129,7 +140,7 @@ lazy_static! {
/// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning.
mod sync_queue {
use std::{
collections::{BTreeSet, HashMap},
collections::HashMap,
sync::atomic::{AtomicUsize, Ordering},
};
@@ -192,9 +203,9 @@ mod sync_queue {
pub async fn next_task_batch(
receiver: &mut UnboundedReceiver<SyncTask>,
mut max_batch_size: usize,
) -> BTreeSet<SyncTask> {
) -> Vec<SyncTask> {
if max_batch_size == 0 {
return BTreeSet::new();
return Vec::new();
}
let mut tasks = HashMap::with_capacity(max_batch_size);
@@ -231,7 +242,7 @@ mod sync_queue {
/// A task to run in the async download/upload loop.
/// Limited by the number of retries, after certain threshold the failing task gets evicted and the timeline disabled.
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
#[derive(Debug, Clone)]
pub struct SyncTask {
sync_id: ZTenantTimelineId,
retries: u32,
@@ -248,7 +259,7 @@ impl SyncTask {
}
}
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
#[derive(Debug, Clone)]
enum SyncKind {
/// A certain amount of images (archive files) to download.
Download(TimelineDownload),
@@ -268,15 +279,15 @@ impl SyncKind {
/// Local timeline files for upload, appeared after the new checkpoint.
/// Current checkpoint design assumes new files are added only, no deletions or amendment happens.
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
#[derive(Debug, Clone)]
pub struct NewCheckpoint {
/// Relish file paths in the pageserver workdir, that were added for the corresponding checkpoint.
/// layer file paths in the pageserver workdir, that were added for the corresponding checkpoint.
layers: Vec<PathBuf>,
metadata: TimelineMetadata,
}
/// Info about the remote image files.
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
#[derive(Debug, Clone)]
struct TimelineDownload {
files_to_skip: Arc<BTreeSet<PathBuf>>,
archives_to_skip: BTreeSet<ArchiveId>,
@@ -310,8 +321,8 @@ pub fn schedule_timeline_checkpoint_upload(
tenant_id, timeline_id
)
} else {
warn!(
"Could not send an upload task for tenant {}, timeline {}: the sync queue is not initialized",
debug!(
"Upload task for tenant {}, timeline {} sent",
tenant_id, timeline_id
)
}
@@ -379,35 +390,42 @@ pub(super) fn spawn_storage_sync_thread<
None
}
});
let remote_index = RemoteTimelineIndex::try_parse_descriptions_from_paths(conf, download_paths);
let remote_index = RemoteIndex::try_parse_descriptions_from_paths(conf, download_paths);
let initial_timeline_states = schedule_first_sync_tasks(&remote_index, local_timeline_files);
let local_timeline_init_statuses = schedule_first_sync_tasks(
&mut runtime.block_on(remote_index.write()),
local_timeline_files,
);
let loop_index = remote_index.clone();
thread_mgr::spawn(
ThreadKind::StorageSync,
None,
None,
"Remote storage sync thread",
false,
move || {
storage_sync_loop(
runtime,
conf,
receiver,
remote_index,
loop_index,
storage,
max_concurrent_sync,
max_sync_errors,
)
);
Ok(())
},
)
.context("Failed to spawn remote storage sync thread")?;
Ok(SyncStartupData {
initial_timeline_states,
remote_index,
local_timeline_init_statuses,
})
}
enum LoopStep {
NewStates(HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>>),
SyncStatusUpdates(HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncStatusUpdate>>),
Shutdown,
}
@@ -419,41 +437,48 @@ fn storage_sync_loop<
runtime: Runtime,
conf: &'static PageServerConf,
mut receiver: UnboundedReceiver<SyncTask>,
index: RemoteTimelineIndex,
index: RemoteIndex,
storage: S,
max_concurrent_sync: NonZeroUsize,
max_sync_errors: NonZeroU32,
) -> anyhow::Result<()> {
let remote_assets = Arc::new((storage, RwLock::new(index)));
) {
let remote_assets = Arc::new((storage, index.clone()));
info!("Starting remote storage sync loop");
loop {
let index = index.clone();
let loop_step = runtime.block_on(async {
tokio::select! {
new_timeline_states = loop_step(
step = loop_step(
conf,
&mut receiver,
Arc::clone(&remote_assets),
max_concurrent_sync,
max_sync_errors,
)
.instrument(debug_span!("storage_sync_loop_step")) => LoopStep::NewStates(new_timeline_states),
.instrument(info_span!("storage_sync_loop_step")) => step,
_ = thread_mgr::shutdown_watcher() => LoopStep::Shutdown,
}
});
match loop_step {
LoopStep::NewStates(new_timeline_states) => {
// Batch timeline download registration to ensure that the external registration code won't block any running tasks before.
set_timeline_states(conf, new_timeline_states);
debug!("Sync loop step completed");
LoopStep::SyncStatusUpdates(new_timeline_states) => {
if new_timeline_states.is_empty() {
debug!("Sync loop step completed, no new timeline states");
} else {
info!(
"Sync loop step completed, {} new timeline state update(s)",
new_timeline_states.len()
);
// Batch timeline download registration to ensure that the external registration code won't block any running tasks before.
apply_timeline_sync_status_updates(conf, index, new_timeline_states);
}
}
LoopStep::Shutdown => {
debug!("Shutdown requested, stopping");
info!("Shutdown requested, stopping");
break;
}
}
}
Ok(())
}
async fn loop_step<
@@ -462,19 +487,18 @@ async fn loop_step<
>(
conf: &'static PageServerConf,
receiver: &mut UnboundedReceiver<SyncTask>,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
remote_assets: Arc<(S, RemoteIndex)>,
max_concurrent_sync: NonZeroUsize,
max_sync_errors: NonZeroU32,
) -> HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>> {
) -> LoopStep {
let max_concurrent_sync = max_concurrent_sync.get();
let mut next_tasks = BTreeSet::new();
let mut next_tasks = Vec::new();
// request the first task in blocking fashion to do less meaningless work
if let Some(first_task) = sync_queue::next_task(receiver).await {
next_tasks.insert(first_task);
next_tasks.push(first_task);
} else {
debug!("Shutdown requested, stopping");
return HashMap::new();
return LoopStep::Shutdown;
};
next_tasks.extend(
sync_queue::next_task_batch(receiver, max_concurrent_sync - 1)
@@ -483,12 +507,17 @@ async fn loop_step<
);
let remaining_queue_length = sync_queue::len();
debug!(
"Processing {} tasks in batch, more tasks left to process: {}",
next_tasks.len(),
remaining_queue_length
);
REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64);
if remaining_queue_length > 0 || !next_tasks.is_empty() {
info!(
"Processing {} tasks in batch, more tasks left to process: {}",
next_tasks.len(),
remaining_queue_length
);
} else {
debug!("No tasks to process");
return LoopStep::SyncStatusUpdates(HashMap::new());
}
let mut task_batch = next_tasks
.into_iter()
@@ -498,8 +527,9 @@ async fn loop_step<
let sync_name = task.kind.sync_name();
let extra_step = match tokio::spawn(
process_task(conf, Arc::clone(&remote_assets), task, max_sync_errors)
.instrument(debug_span!("", sync_id = %sync_id, attempt, sync_name)),
process_task(conf, Arc::clone(&remote_assets), task, max_sync_errors).instrument(
info_span!("process_sync_task", sync_id = %sync_id, attempt, sync_name),
),
)
.await
{
@@ -516,8 +546,10 @@ async fn loop_step<
})
.collect::<FuturesUnordered<_>>();
let mut new_timeline_states: HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>> =
HashMap::with_capacity(max_concurrent_sync);
let mut new_timeline_states: HashMap<
ZTenantId,
HashMap<ZTimelineId, TimelineSyncStatusUpdate>,
> = HashMap::with_capacity(max_concurrent_sync);
while let Some((sync_id, state_update)) = task_batch.next().await {
debug!("Finished storage sync task for sync id {}", sync_id);
if let Some(state_update) = state_update {
@@ -532,7 +564,7 @@ async fn loop_step<
}
}
new_timeline_states
LoopStep::SyncStatusUpdates(new_timeline_states)
}
async fn process_task<
@@ -540,24 +572,19 @@ async fn process_task<
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
conf: &'static PageServerConf,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
remote_assets: Arc<(S, RemoteIndex)>,
task: SyncTask,
max_sync_errors: NonZeroU32,
) -> Option<TimelineSyncState> {
) -> Option<TimelineSyncStatusUpdate> {
if task.retries > max_sync_errors.get() {
error!(
"Evicting task {:?} that failed {} times, exceeding the error threshold",
task.kind, task.retries
);
return Some(TimelineSyncState::Evicted(
remote_assets
.as_ref()
.1
.read()
.await
.timeline_entry(&task.sync_id)
.and_then(TimelineIndexEntry::disk_consistent_lsn),
));
FATAL_TASK_FAILURES.inc();
// FIXME (rodionov) this can potentially leave holes in timeline uploads
// planneed to be fixed as part of https://github.com/zenithdb/zenith/issues/977
return None;
}
if task.retries > 0 {
@@ -569,13 +596,15 @@ async fn process_task<
tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await;
}
let remote_index = &remote_assets.1;
let sync_start = Instant::now();
let sync_name = task.kind.sync_name();
match task.kind {
SyncKind::Download(download_data) => {
let download_result = download_timeline(
conf,
remote_assets,
remote_assets.clone(),
task.sync_id,
download_data,
task.retries + 1,
@@ -585,19 +614,25 @@ async fn process_task<
match download_result {
DownloadedTimeline::Abort => {
register_sync_status(sync_start, sync_name, None);
remote_index
.write()
.await
.set_awaits_download(&task.sync_id, false)
.expect("timeline should be present in remote index");
None
}
DownloadedTimeline::FailedAndRescheduled {
disk_consistent_lsn,
} => {
DownloadedTimeline::FailedAndRescheduled => {
register_sync_status(sync_start, sync_name, Some(false));
Some(TimelineSyncState::AwaitsDownload(disk_consistent_lsn))
None
}
DownloadedTimeline::Successful {
disk_consistent_lsn,
} => {
DownloadedTimeline::Successful => {
register_sync_status(sync_start, sync_name, Some(true));
Some(TimelineSyncState::Ready(disk_consistent_lsn))
remote_index
.write()
.await
.set_awaits_download(&task.sync_id, false)
.expect("timeline should be present in remote index");
Some(TimelineSyncStatusUpdate::Downloaded)
}
}
}
@@ -617,45 +652,45 @@ async fn process_task<
}
fn schedule_first_sync_tasks(
index: &RemoteTimelineIndex,
index: &mut RemoteTimelineIndex,
local_timeline_files: HashMap<ZTenantTimelineId, (TimelineMetadata, Vec<PathBuf>)>,
) -> HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>> {
let mut initial_timeline_statuses: HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>> =
HashMap::new();
) -> LocalTimelineInitStatuses {
let mut local_timeline_init_statuses = LocalTimelineInitStatuses::new();
let mut new_sync_tasks =
VecDeque::with_capacity(local_timeline_files.len().max(local_timeline_files.len()));
for (sync_id, (local_metadata, local_files)) in local_timeline_files {
let local_disk_consistent_lsn = local_metadata.disk_consistent_lsn();
let ZTenantTimelineId {
tenant_id,
timeline_id,
} = sync_id;
match index.timeline_entry(&sync_id) {
match index.timeline_entry_mut(&sync_id) {
Some(index_entry) => {
let timeline_status = compare_local_and_remote_timeline(
let (timeline_status, awaits_download) = compare_local_and_remote_timeline(
&mut new_sync_tasks,
sync_id,
local_metadata,
local_files,
index_entry,
);
match timeline_status {
Some(timeline_status) => {
initial_timeline_statuses
.entry(tenant_id)
.or_default()
.insert(timeline_id, timeline_status);
}
None => error!(
"Failed to compare local and remote timeline for task {}",
sync_id
),
let was_there = local_timeline_init_statuses
.entry(tenant_id)
.or_default()
.insert(timeline_id, timeline_status);
if was_there.is_some() {
// defensive check
warn!(
"Overwriting timeline init sync status. Status {:?} Timeline {}",
timeline_status, timeline_id
);
}
index_entry.set_awaits_download(awaits_download);
}
None => {
// TODO (rodionov) does this mean that we've crashed during tenant creation?
// is it safe to upload this checkpoint? could it be half broken?
new_sync_tasks.push_back(SyncTask::new(
sync_id,
0,
@@ -664,56 +699,18 @@ fn schedule_first_sync_tasks(
metadata: local_metadata,
}),
));
initial_timeline_statuses
local_timeline_init_statuses
.entry(tenant_id)
.or_default()
.insert(
timeline_id,
TimelineSyncState::Ready(local_disk_consistent_lsn),
);
.insert(timeline_id, LocalTimelineInitStatus::LocallyComplete);
}
}
}
let unprocessed_remote_ids = |remote_id: &ZTenantTimelineId| {
initial_timeline_statuses
.get(&remote_id.tenant_id)
.and_then(|timelines| timelines.get(&remote_id.timeline_id))
.is_none()
};
for unprocessed_remote_id in index
.all_sync_ids()
.filter(unprocessed_remote_ids)
.collect::<Vec<_>>()
{
let ZTenantTimelineId {
tenant_id: cloud_only_tenant_id,
timeline_id: cloud_only_timeline_id,
} = unprocessed_remote_id;
match index
.timeline_entry(&unprocessed_remote_id)
.and_then(TimelineIndexEntry::disk_consistent_lsn)
{
Some(remote_disk_consistent_lsn) => {
initial_timeline_statuses
.entry(cloud_only_tenant_id)
.or_default()
.insert(
cloud_only_timeline_id,
TimelineSyncState::CloudOnly(remote_disk_consistent_lsn),
);
}
None => error!(
"Failed to find disk consistent LSN for remote timeline {}",
unprocessed_remote_id
),
}
}
new_sync_tasks.into_iter().for_each(|task| {
sync_queue::push(task);
});
initial_timeline_statuses
local_timeline_init_statuses
}
fn compare_local_and_remote_timeline(
@@ -722,10 +719,21 @@ fn compare_local_and_remote_timeline(
local_metadata: TimelineMetadata,
local_files: Vec<PathBuf>,
remote_entry: &TimelineIndexEntry,
) -> Option<TimelineSyncState> {
) -> (LocalTimelineInitStatus, bool) {
let local_lsn = local_metadata.disk_consistent_lsn();
let uploads = remote_entry.uploaded_checkpoints();
let mut initial_timeline_status = LocalTimelineInitStatus::LocallyComplete;
let mut awaits_download = false;
// TODO probably here we need more sophisticated logic,
// if more data is available remotely can we just download whats there?
// without trying to upload something. It may be tricky, needs further investigation.
// For now looks strange that we can request upload
// and dowload for the same timeline simultaneously.
// (upload needs to be only for previously unsynced files, not whole timeline dir).
// If one of the tasks fails they will be reordered in the queue which can lead
// to timeline being stuck in evicted state
if !uploads.contains(&local_lsn) {
new_sync_tasks.push_back(SyncTask::new(
sync_id,
@@ -735,6 +743,7 @@ fn compare_local_and_remote_timeline(
metadata: local_metadata,
}),
));
// Note that status here doesnt change.
}
let uploads_count = uploads.len();
@@ -743,7 +752,7 @@ fn compare_local_and_remote_timeline(
.filter(|upload_lsn| upload_lsn <= &local_lsn)
.map(ArchiveId)
.collect();
Some(if archives_to_skip.len() != uploads_count {
if archives_to_skip.len() != uploads_count {
new_sync_tasks.push_back(SyncTask::new(
sync_id,
0,
@@ -752,10 +761,12 @@ fn compare_local_and_remote_timeline(
archives_to_skip,
}),
));
TimelineSyncState::AwaitsDownload(remote_entry.disk_consistent_lsn()?)
} else {
TimelineSyncState::Ready(remote_entry.disk_consistent_lsn().unwrap_or(local_lsn))
})
initial_timeline_status = LocalTimelineInitStatus::NeedsSync;
awaits_download = true;
// we do not need to manupulate with remote consistent lsn here
// because it will be updated when sync will be completed
}
(initial_timeline_status, awaits_download)
}
fn register_sync_status(sync_start: Instant, sync_name: &str, sync_status: Option<bool>) {
@@ -769,21 +780,23 @@ fn register_sync_status(sync_start: Instant, sync_name: &str, sync_status: Optio
.observe(secs_elapsed)
}
async fn update_index_description<
async fn fetch_full_index<
P: Send + Sync + 'static,
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
(storage, index): &(S, RwLock<RemoteTimelineIndex>),
(storage, index): &(S, RemoteIndex),
timeline_dir: &Path,
id: ZTenantTimelineId,
) -> anyhow::Result<RemoteTimeline> {
let mut index_write = index.write().await;
let full_index = match index_write.timeline_entry(&id) {
let index_read = index.read().await;
let full_index = match index_read.timeline_entry(&id).map(|e| e.inner()) {
None => bail!("Timeline not found for sync id {}", id),
Some(TimelineIndexEntry::Full(_)) => bail!("Index is already populated for sync id {}", id),
Some(TimelineIndexEntry::Description(description)) => {
Some(TimelineIndexEntryInner::Full(_)) => {
bail!("Index is already populated for sync id {}", id)
}
Some(TimelineIndexEntryInner::Description(description)) => {
let mut archive_header_downloads = FuturesUnordered::new();
for (&archive_id, description) in description {
for (archive_id, description) in description {
archive_header_downloads.push(async move {
let header = download_archive_header(storage, timeline_dir, description)
.await
@@ -795,18 +808,23 @@ async fn update_index_description<
let mut full_index = RemoteTimeline::empty();
while let Some(header_data) = archive_header_downloads.next().await {
match header_data {
Ok((archive_id, header_size, header)) => full_index.update_archive_contents(archive_id.0, header, header_size),
Err((e, archive_id)) => bail!(
"Failed to download archive header for tenant {}, timeline {}, archive for Lsn {}: {}",
id.tenant_id, id.timeline_id, archive_id.0,
e
),
}
Ok((archive_id, header_size, header)) => full_index.update_archive_contents(archive_id.0, header, header_size),
Err((e, archive_id)) => bail!(
"Failed to download archive header for tenant {}, timeline {}, archive for Lsn {}: {}",
id.tenant_id, id.timeline_id, archive_id.0,
e
),
}
}
full_index
}
};
index_write.add_timeline_entry(id, TimelineIndexEntry::Full(full_index.clone()));
drop(index_read); // tokio rw lock is not upgradeable
index
.write()
.await
.upgrade_timeline_entry(&id, full_index.clone())
.context("cannot upgrade timeline entry in remote index")?;
Ok(full_index)
}
@@ -849,8 +867,8 @@ mod test_utils {
#[track_caller]
pub async fn ensure_correct_timeline_upload(
harness: &RepoHarness,
remote_assets: Arc<(LocalFs, RwLock<RemoteTimelineIndex>)>,
harness: &RepoHarness<'_>,
remote_assets: Arc<(LocalFs, RemoteIndex)>,
timeline_id: ZTimelineId,
new_upload: NewCheckpoint,
) {
@@ -867,7 +885,7 @@ mod test_utils {
let (storage, index) = remote_assets.as_ref();
assert_index_descriptions(
index,
RemoteTimelineIndex::try_parse_descriptions_from_paths(
&RemoteIndex::try_parse_descriptions_from_paths(
harness.conf,
remote_assets
.0
@@ -909,11 +927,14 @@ mod test_utils {
}
pub async fn expect_timeline(
index: &RwLock<RemoteTimelineIndex>,
index: &RemoteIndex,
sync_id: ZTenantTimelineId,
) -> RemoteTimeline {
if let Some(TimelineIndexEntry::Full(remote_timeline)) =
index.read().await.timeline_entry(&sync_id)
if let Some(TimelineIndexEntryInner::Full(remote_timeline)) = index
.read()
.await
.timeline_entry(&sync_id)
.map(|e| e.inner())
{
remote_timeline.clone()
} else {
@@ -926,9 +947,11 @@ mod test_utils {
#[track_caller]
pub async fn assert_index_descriptions(
index: &RwLock<RemoteTimelineIndex>,
expected_index_with_descriptions: RemoteTimelineIndex,
index: &RemoteIndex,
expected_index_with_descriptions: &RemoteIndex,
) {
let expected_index_with_descriptions = expected_index_with_descriptions.read().await;
let index_read = index.read().await;
let actual_sync_ids = index_read.all_sync_ids().collect::<BTreeSet<_>>();
let expected_sync_ids = expected_index_with_descriptions
@@ -965,26 +988,26 @@ mod test_utils {
sync_id
)
});
let expected_timeline_description = match expected_timeline_description {
TimelineIndexEntry::Description(description) => description,
TimelineIndexEntry::Full(_) => panic!("Expected index entry for sync id {} is a full entry, while a description was expected", sync_id),
let expected_timeline_description = match expected_timeline_description.inner() {
TimelineIndexEntryInner::Description(description) => description,
TimelineIndexEntryInner::Full(_) => panic!("Expected index entry for sync id {} is a full entry, while a description was expected", sync_id),
};
match actual_timeline_entry {
TimelineIndexEntry::Description(actual_descriptions) => {
match actual_timeline_entry.inner() {
TimelineIndexEntryInner::Description(description) => {
assert_eq!(
actual_descriptions, expected_timeline_description,
description, expected_timeline_description,
"Index contains unexpected descriptions entry for sync id {}",
sync_id
)
}
TimelineIndexEntry::Full(actual_full_entry) => {
TimelineIndexEntryInner::Full(remote_timeline) => {
let expected_lsns = expected_timeline_description
.values()
.map(|description| description.disk_consistent_lsn)
.collect::<BTreeSet<_>>();
assert_eq!(
actual_full_entry.checkpoints().collect::<BTreeSet<_>>(),
remote_timeline.checkpoints().collect::<BTreeSet<_>>(),
expected_lsns,
"Timeline {} should have the same checkpoints uploaded",
sync_id,

View File

@@ -10,7 +10,7 @@
//! Archiving is almost agnostic to timeline file types, with an exception of the metadata file, that's currently distinguished in the [un]compression code.
//! The metadata file is treated separately when [de]compression is involved, to reduce the risk of corrupting the metadata file.
//! When compressed, the metadata file is always required and stored as the last file in the archive stream.
//! When uncompressed, the metadata file gets naturally uncompressed last, to ensure that all other relishes are decompressed successfully first.
//! When uncompressed, the metadata file gets naturally uncompressed last, to ensure that all other layer files are decompressed successfully first.
//!
//! Archive structure:
//! +----------------------------------------+
@@ -201,8 +201,7 @@ pub async fn read_archive_header<A: io::AsyncRead + Send + Sync + Unpin>(
.await
.context("Failed to decompress a header from the archive")?;
Ok(ArchiveHeader::des(&header_bytes)
.context("Failed to deserialize a header from the archive")?)
ArchiveHeader::des(&header_bytes).context("Failed to deserialize a header from the archive")
}
/// Reads the archive metadata out of the archive name:

View File

@@ -1,18 +1,18 @@
//! Timeline synchrnonization logic to put files from archives on remote storage into pageserver's local directory.
use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc};
use std::{collections::BTreeSet, path::PathBuf, sync::Arc};
use anyhow::{ensure, Context};
use tokio::{fs, sync::RwLock};
use tokio::fs;
use tracing::{debug, error, trace, warn};
use zenith_utils::{lsn::Lsn, zid::ZTenantId};
use zenith_utils::zid::ZTenantId;
use crate::{
config::PageServerConf,
layered_repository::metadata::{metadata_path, TimelineMetadata},
remote_storage::{
storage_sync::{
compression, index::TimelineIndexEntry, sync_queue, update_index_description, SyncKind,
compression, fetch_full_index, index::TimelineIndexEntryInner, sync_queue, SyncKind,
SyncTask,
},
RemoteStorage, ZTenantTimelineId,
@@ -20,8 +20,8 @@ use crate::{
};
use super::{
index::{ArchiveId, RemoteTimeline, RemoteTimelineIndex},
TimelineDownload,
index::{ArchiveId, RemoteTimeline},
RemoteIndex, TimelineDownload,
};
/// Timeline download result, with extra data, needed for downloading.
@@ -30,10 +30,10 @@ pub(super) enum DownloadedTimeline {
Abort,
/// Remote timeline data is found, its latest checkpoint's metadata contents (disk_consistent_lsn) is known.
/// Initial download failed due to some error, the download task is rescheduled for another retry.
FailedAndRescheduled { disk_consistent_lsn: Lsn },
FailedAndRescheduled,
/// Remote timeline data is found, its latest checkpoint's metadata contents (disk_consistent_lsn) is known.
/// Initial download successful.
Successful { disk_consistent_lsn: Lsn },
Successful,
}
/// Attempts to download and uncompress files from all remote archives for the timeline given.
@@ -47,7 +47,7 @@ pub(super) async fn download_timeline<
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
conf: &'static PageServerConf,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
remote_assets: Arc<(S, RemoteIndex)>,
sync_id: ZTenantTimelineId,
mut download: TimelineDownload,
retries: u32,
@@ -58,38 +58,49 @@ pub(super) async fn download_timeline<
tenant_id,
timeline_id,
} = sync_id;
let index_read = remote_assets.1.read().await;
let index = &remote_assets.1;
let index_read = index.read().await;
let remote_timeline = match index_read.timeline_entry(&sync_id) {
None => {
error!("Cannot download: no timeline is present in the index for given ids");
error!("Cannot download: no timeline is present in the index for given id");
drop(index_read);
return DownloadedTimeline::Abort;
}
Some(index_entry) => match index_entry {
TimelineIndexEntry::Full(remote_timeline) => Cow::Borrowed(remote_timeline),
TimelineIndexEntry::Description(_) => {
Some(index_entry) => match index_entry.inner() {
TimelineIndexEntryInner::Full(remote_timeline) => {
let cloned = remote_timeline.clone();
drop(index_read);
cloned
}
TimelineIndexEntryInner::Description(_) => {
// we do not check here for awaits_download because it is ok
// to call this function while the download is in progress
// so it is not a concurrent download, it is the same one
let remote_disk_consistent_lsn = index_entry.disk_consistent_lsn();
drop(index_read);
debug!("Found timeline description for the given ids, downloading the full index");
match update_index_description(
match fetch_full_index(
remote_assets.as_ref(),
&conf.timeline_path(&timeline_id, &tenant_id),
sync_id,
)
.await
{
Ok(remote_timeline) => Cow::Owned(remote_timeline),
Ok(remote_timeline) => remote_timeline,
Err(e) => {
error!("Failed to download full timeline index: {:?}", e);
return match remote_disk_consistent_lsn {
Some(disk_consistent_lsn) => {
Some(_) => {
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Download(download),
));
DownloadedTimeline::FailedAndRescheduled {
disk_consistent_lsn,
}
DownloadedTimeline::FailedAndRescheduled
}
None => {
error!("Cannot download: no disk consistent Lsn is present for the index entry");
@@ -101,12 +112,9 @@ pub(super) async fn download_timeline<
}
},
};
let disk_consistent_lsn = match remote_timeline.checkpoints().max() {
Some(lsn) => lsn,
None => {
debug!("Cannot download: no disk consistent Lsn is present for the remote timeline");
return DownloadedTimeline::Abort;
}
if remote_timeline.checkpoints().max().is_none() {
debug!("Cannot download: no disk consistent Lsn is present for the remote timeline");
return DownloadedTimeline::Abort;
};
debug!("Downloading timeline archives");
@@ -125,7 +133,7 @@ pub(super) async fn download_timeline<
conf,
sync_id,
Arc::clone(&remote_assets),
remote_timeline.as_ref(),
&remote_timeline,
archive_id,
Arc::clone(&download.files_to_skip),
)
@@ -142,9 +150,7 @@ pub(super) async fn download_timeline<
retries,
SyncKind::Download(download),
));
return DownloadedTimeline::FailedAndRescheduled {
disk_consistent_lsn,
};
return DownloadedTimeline::FailedAndRescheduled;
}
Ok(()) => {
debug!("Successfully downloaded archive {:?}", archive_id);
@@ -154,9 +160,7 @@ pub(super) async fn download_timeline<
}
debug!("Finished downloading all timeline's archives");
DownloadedTimeline::Successful {
disk_consistent_lsn,
}
DownloadedTimeline::Successful
}
async fn try_download_archive<
@@ -168,7 +172,7 @@ async fn try_download_archive<
tenant_id,
timeline_id,
}: ZTenantTimelineId,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
remote_assets: Arc<(S, RemoteIndex)>,
remote_timeline: &RemoteTimeline,
archive_id: ArchiveId,
files_to_skip: Arc<BTreeSet<PathBuf>>,
@@ -226,8 +230,8 @@ async fn read_local_metadata(
let local_metadata_bytes = fs::read(&local_metadata_path)
.await
.context("Failed to read local metadata file bytes")?;
Ok(TimelineMetadata::from_bytes(&local_metadata_bytes)
.context("Failed to read local metadata files bytes")?)
TimelineMetadata::from_bytes(&local_metadata_bytes)
.context("Failed to read local metadata files bytes")
}
#[cfg(test)]
@@ -256,14 +260,14 @@ mod tests {
let repo_harness = RepoHarness::create("test_download_timeline")?;
let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?;
let index = RwLock::new(RemoteTimelineIndex::try_parse_descriptions_from_paths(
let index = RemoteIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
storage
.list()
.await?
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
));
);
let remote_assets = Arc::new((storage, index));
let storage = &remote_assets.0;
let index = &remote_assets.1;
@@ -313,7 +317,7 @@ mod tests {
.await;
assert_index_descriptions(
index,
RemoteTimelineIndex::try_parse_descriptions_from_paths(
&RemoteIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
remote_assets
.0

View File

@@ -7,11 +7,13 @@
use std::{
collections::{BTreeMap, BTreeSet, HashMap},
path::{Path, PathBuf},
sync::Arc,
};
use anyhow::{bail, ensure, Context};
use serde::{Deserialize, Serialize};
use tracing::debug;
use tokio::sync::RwLock;
use tracing::*;
use zenith_utils::{
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
@@ -52,10 +54,19 @@ impl RelativePath {
/// Currently, timeline archive files are tracked only.
#[derive(Debug, Clone)]
pub struct RemoteTimelineIndex {
timeline_files: HashMap<ZTenantTimelineId, TimelineIndexEntry>,
timeline_entries: HashMap<ZTenantTimelineId, TimelineIndexEntry>,
}
impl RemoteTimelineIndex {
/// A wrapper to synchrnize access to the index, should be created and used before dealing with any [`RemoteTimelineIndex`].
pub struct RemoteIndex(Arc<RwLock<RemoteTimelineIndex>>);
impl RemoteIndex {
pub fn empty() -> Self {
Self(Arc::new(RwLock::new(RemoteTimelineIndex {
timeline_entries: HashMap::new(),
})))
}
/// Attempts to parse file paths (not checking the file contents) and find files
/// that can be tracked wiht the index.
/// On parse falures, logs the error and continues, so empty index can be created from not suitable paths.
@@ -63,8 +74,8 @@ impl RemoteTimelineIndex {
conf: &'static PageServerConf,
paths: impl Iterator<Item = P>,
) -> Self {
let mut index = Self {
timeline_files: HashMap::new(),
let mut index = RemoteTimelineIndex {
timeline_entries: HashMap::new(),
};
for path in paths {
if let Err(e) = try_parse_index_entry(&mut index, conf, path.as_ref()) {
@@ -75,44 +86,121 @@ impl RemoteTimelineIndex {
);
}
}
index
Self(Arc::new(RwLock::new(index)))
}
pub async fn read(&self) -> tokio::sync::RwLockReadGuard<'_, RemoteTimelineIndex> {
self.0.read().await
}
pub async fn write(&self) -> tokio::sync::RwLockWriteGuard<'_, RemoteTimelineIndex> {
self.0.write().await
}
}
impl Clone for RemoteIndex {
fn clone(&self) -> Self {
Self(Arc::clone(&self.0))
}
}
impl RemoteTimelineIndex {
pub fn timeline_entry(&self, id: &ZTenantTimelineId) -> Option<&TimelineIndexEntry> {
self.timeline_files.get(id)
self.timeline_entries.get(id)
}
pub fn timeline_entry_mut(
&mut self,
id: &ZTenantTimelineId,
) -> Option<&mut TimelineIndexEntry> {
self.timeline_files.get_mut(id)
self.timeline_entries.get_mut(id)
}
pub fn add_timeline_entry(&mut self, id: ZTenantTimelineId, entry: TimelineIndexEntry) {
self.timeline_files.insert(id, entry);
self.timeline_entries.insert(id, entry);
}
pub fn upgrade_timeline_entry(
&mut self,
id: &ZTenantTimelineId,
remote_timeline: RemoteTimeline,
) -> anyhow::Result<()> {
let mut entry = self.timeline_entries.get_mut(id).ok_or(anyhow::anyhow!(
"timeline is unexpectedly missing from remote index"
))?;
if !matches!(entry.inner, TimelineIndexEntryInner::Description(_)) {
anyhow::bail!("timeline entry is not a description entry")
};
entry.inner = TimelineIndexEntryInner::Full(remote_timeline);
Ok(())
}
pub fn all_sync_ids(&self) -> impl Iterator<Item = ZTenantTimelineId> + '_ {
self.timeline_files.keys().copied()
self.timeline_entries.keys().copied()
}
pub fn set_awaits_download(
&mut self,
id: &ZTenantTimelineId,
awaits_download: bool,
) -> anyhow::Result<()> {
self.timeline_entry_mut(id)
.ok_or_else(|| anyhow::anyhow!("unknown timeline sync {}", id))?
.set_awaits_download(awaits_download);
Ok(())
}
}
#[derive(Debug, Clone, PartialEq, Eq, Default)]
pub struct DescriptionTimelineIndexEntry {
pub description: BTreeMap<ArchiveId, ArchiveDescription>,
pub awaits_download: bool,
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TimelineIndexEntry {
/// An archive found on the remote storage, but not yet downloaded, only a metadata from its storage path is available, without archive contents.
pub struct FullTimelineIndexEntry {
pub remote_timeline: RemoteTimeline,
pub awaits_download: bool,
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TimelineIndexEntryInner {
Description(BTreeMap<ArchiveId, ArchiveDescription>),
/// Full archive metadata, including the file list, parsed from the archive header.
Full(RemoteTimeline),
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct TimelineIndexEntry {
inner: TimelineIndexEntryInner,
awaits_download: bool,
}
impl TimelineIndexEntry {
pub fn new(inner: TimelineIndexEntryInner, awaits_download: bool) -> Self {
Self {
inner,
awaits_download,
}
}
pub fn inner(&self) -> &TimelineIndexEntryInner {
&self.inner
}
pub fn inner_mut(&mut self) -> &mut TimelineIndexEntryInner {
&mut self.inner
}
pub fn uploaded_checkpoints(&self) -> BTreeSet<Lsn> {
match self {
Self::Description(description) => {
match &self.inner {
TimelineIndexEntryInner::Description(description) => {
description.keys().map(|archive_id| archive_id.0).collect()
}
Self::Full(remote_timeline) => remote_timeline
TimelineIndexEntryInner::Full(remote_timeline) => remote_timeline
.checkpoint_archives
.keys()
.map(|archive_id| archive_id.0)
@@ -122,17 +210,25 @@ impl TimelineIndexEntry {
/// Gets latest uploaded checkpoint's disk consisten Lsn for the corresponding timeline.
pub fn disk_consistent_lsn(&self) -> Option<Lsn> {
match self {
Self::Description(description) => {
match &self.inner {
TimelineIndexEntryInner::Description(description) => {
description.keys().map(|archive_id| archive_id.0).max()
}
Self::Full(remote_timeline) => remote_timeline
TimelineIndexEntryInner::Full(remote_timeline) => remote_timeline
.checkpoint_archives
.keys()
.map(|archive_id| archive_id.0)
.max(),
}
}
pub fn get_awaits_download(&self) -> bool {
self.awaits_download
}
pub fn set_awaits_download(&mut self, awaits_download: bool) {
self.awaits_download = awaits_download;
}
}
/// Checkpoint archive's id, corresponding to the `disk_consistent_lsn` from the timeline's metadata file during checkpointing.
@@ -181,7 +277,7 @@ impl RemoteTimeline {
.map(CheckpointArchive::disk_consistent_lsn)
}
/// Lists all relish files in the given remote timeline. Omits the metadata file.
/// Lists all layer files in the given remote timeline. Omits the metadata file.
pub fn stored_files(&self, timeline_dir: &Path) -> BTreeSet<PathBuf> {
self.timeline_files
.values()
@@ -331,13 +427,15 @@ fn try_parse_index_entry(
tenant_id,
timeline_id,
};
let timeline_index_entry = index
.timeline_files
.entry(sync_id)
.or_insert_with(|| TimelineIndexEntry::Description(BTreeMap::new()));
match timeline_index_entry {
TimelineIndexEntry::Description(descriptions) => {
descriptions.insert(
let timeline_index_entry = index.timeline_entries.entry(sync_id).or_insert_with(|| {
TimelineIndexEntry::new(
TimelineIndexEntryInner::Description(BTreeMap::default()),
false,
)
});
match timeline_index_entry.inner_mut() {
TimelineIndexEntryInner::Description(description) => {
description.insert(
ArchiveId(disk_consistent_lsn),
ArchiveDescription {
header_size,
@@ -346,7 +444,7 @@ fn try_parse_index_entry(
},
);
}
TimelineIndexEntry::Full(_) => {
TimelineIndexEntryInner::Full(_) => {
bail!("Cannot add parsed archive description to its full context in index with sync id {}", sync_id)
}
}

View File

@@ -1,24 +1,22 @@
//! Timeline synchronization logic to compress and upload to the remote storage all new timeline files from the checkpoints.
use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc};
use std::{collections::BTreeSet, path::PathBuf, sync::Arc};
use anyhow::ensure;
use tokio::sync::RwLock;
use tracing::{debug, error, warn};
use crate::{
config::PageServerConf,
remote_storage::{
storage_sync::{
compression,
index::{RemoteTimeline, TimelineIndexEntry},
sync_queue, update_index_description, SyncKind, SyncTask,
compression, fetch_full_index,
index::{RemoteTimeline, TimelineIndexEntry, TimelineIndexEntryInner},
sync_queue, SyncKind, SyncTask,
},
RemoteStorage, ZTenantTimelineId,
},
};
use super::{compression::ArchiveHeader, index::RemoteTimelineIndex, NewCheckpoint};
use super::{compression::ArchiveHeader, NewCheckpoint, RemoteIndex};
/// Attempts to compress and upload given checkpoint files.
/// No extra checks for overlapping files is made: download takes care of that, ensuring no non-metadata local timeline files are overwritten.
@@ -30,7 +28,7 @@ pub(super) async fn upload_timeline_checkpoint<
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
config: &'static PageServerConf,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
remote_assets: Arc<(S, RemoteIndex)>,
sync_id: ZTenantTimelineId,
new_checkpoint: NewCheckpoint,
retries: u32,
@@ -48,23 +46,33 @@ pub(super) async fn upload_timeline_checkpoint<
let index_read = index.read().await;
let remote_timeline = match index_read.timeline_entry(&sync_id) {
None => None,
Some(TimelineIndexEntry::Full(remote_timeline)) => Some(Cow::Borrowed(remote_timeline)),
Some(TimelineIndexEntry::Description(_)) => {
debug!("Found timeline description for the given ids, downloading the full index");
match update_index_description(remote_assets.as_ref(), &timeline_dir, sync_id).await {
Ok(remote_timeline) => Some(Cow::Owned(remote_timeline)),
Err(e) => {
error!("Failed to download full timeline index: {:?}", e);
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Upload(new_checkpoint),
));
return Some(false);
None => {
drop(index_read);
None
}
Some(entry) => match entry.inner() {
TimelineIndexEntryInner::Full(remote_timeline) => {
let r = Some(remote_timeline.clone());
drop(index_read);
r
}
TimelineIndexEntryInner::Description(_) => {
drop(index_read);
debug!("Found timeline description for the given ids, downloading the full index");
match fetch_full_index(remote_assets.as_ref(), &timeline_dir, sync_id).await {
Ok(remote_timeline) => Some(remote_timeline),
Err(e) => {
error!("Failed to download full timeline index: {:?}", e);
sync_queue::push(SyncTask::new(
sync_id,
retries,
SyncKind::Upload(new_checkpoint),
));
return Some(false);
}
}
}
}
},
};
let already_contains_upload_lsn = remote_timeline
@@ -82,7 +90,6 @@ pub(super) async fn upload_timeline_checkpoint<
let already_uploaded_files = remote_timeline
.map(|timeline| timeline.stored_files(&timeline_dir))
.unwrap_or_default();
drop(index_read);
match try_upload_checkpoint(
config,
@@ -93,30 +100,48 @@ pub(super) async fn upload_timeline_checkpoint<
)
.await
{
Ok((archive_header, header_size)) => {
Some(Ok((archive_header, header_size))) => {
let mut index_write = index.write().await;
match index_write.timeline_entry_mut(&sync_id) {
Some(TimelineIndexEntry::Full(remote_timeline)) => {
remote_timeline.update_archive_contents(
new_checkpoint.metadata.disk_consistent_lsn(),
archive_header,
header_size,
);
}
None | Some(TimelineIndexEntry::Description(_)) => {
match index_write
.timeline_entry_mut(&sync_id)
.map(|e| e.inner_mut())
{
None => {
let mut new_timeline = RemoteTimeline::empty();
new_timeline.update_archive_contents(
new_checkpoint.metadata.disk_consistent_lsn(),
archive_header,
header_size,
);
index_write.add_timeline_entry(sync_id, TimelineIndexEntry::Full(new_timeline));
index_write.add_timeline_entry(
sync_id,
TimelineIndexEntry::new(TimelineIndexEntryInner::Full(new_timeline), false),
)
}
Some(TimelineIndexEntryInner::Full(remote_timeline)) => {
remote_timeline.update_archive_contents(
new_checkpoint.metadata.disk_consistent_lsn(),
archive_header,
header_size,
);
}
Some(TimelineIndexEntryInner::Description(_)) => {
let mut new_timeline = RemoteTimeline::empty();
new_timeline.update_archive_contents(
new_checkpoint.metadata.disk_consistent_lsn(),
archive_header,
header_size,
);
index_write.add_timeline_entry(
sync_id,
TimelineIndexEntry::new(TimelineIndexEntryInner::Full(new_timeline), false),
)
}
}
debug!("Checkpoint uploaded successfully");
Some(true)
}
Err(e) => {
Some(Err(e)) => {
error!(
"Failed to upload checkpoint: {:?}, requeueing the upload",
e
@@ -128,6 +153,7 @@ pub(super) async fn upload_timeline_checkpoint<
));
Some(false)
}
None => Some(true),
}
}
@@ -136,11 +162,11 @@ async fn try_upload_checkpoint<
S: RemoteStorage<StoragePath = P> + Send + Sync + 'static,
>(
config: &'static PageServerConf,
remote_assets: Arc<(S, RwLock<RemoteTimelineIndex>)>,
remote_assets: Arc<(S, RemoteIndex)>,
sync_id: ZTenantTimelineId,
new_checkpoint: &NewCheckpoint,
files_to_skip: BTreeSet<PathBuf>,
) -> anyhow::Result<(ArchiveHeader, u64)> {
) -> Option<anyhow::Result<(ArchiveHeader, u64)>> {
let ZTenantTimelineId {
tenant_id,
timeline_id,
@@ -152,7 +178,7 @@ async fn try_upload_checkpoint<
.iter()
.filter(|&path_to_upload| {
if files_to_skip.contains(path_to_upload) {
error!(
warn!(
"Skipping file upload '{}', since it was already uploaded",
path_to_upload.display()
);
@@ -162,9 +188,16 @@ async fn try_upload_checkpoint<
}
})
.collect::<Vec<_>>();
ensure!(!files_to_upload.is_empty(), "No files to upload");
compression::archive_files_as_stream(
if files_to_upload.is_empty() {
warn!(
"No files to upload. Upload request was: {:?}, already uploaded files: {:?}",
new_checkpoint.layers, files_to_skip
);
return None;
}
let upload_result = compression::archive_files_as_stream(
&timeline_dir,
files_to_upload.into_iter(),
&new_checkpoint.metadata,
@@ -175,12 +208,15 @@ async fn try_upload_checkpoint<
.upload(
archive_streamer,
&remote_storage.storage_path(&timeline_dir.join(&archive_name))?,
None,
)
.await
},
)
.await
.map(|(header, header_size, _)| (header, header_size))
.map(|(header, header_size, _)| (header, header_size));
Some(upload_result)
}
#[cfg(test)]
@@ -209,14 +245,14 @@ mod tests {
let repo_harness = RepoHarness::create("reupload_timeline")?;
let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?;
let index = RwLock::new(RemoteTimelineIndex::try_parse_descriptions_from_paths(
let index = RemoteIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
storage
.list()
.await?
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
));
);
let remote_assets = Arc::new((storage, index));
let index = &remote_assets.1;
@@ -405,14 +441,14 @@ mod tests {
let repo_harness = RepoHarness::create("reupload_timeline_rejected")?;
let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?;
let index = RwLock::new(RemoteTimelineIndex::try_parse_descriptions_from_paths(
let index = RemoteIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
storage
.list()
.await?
.into_iter()
.map(|storage_path| storage.local_path(&storage_path).unwrap()),
));
);
let remote_assets = Arc::new((storage, index));
let storage = &remote_assets.0;
let index = &remote_assets.1;
@@ -431,7 +467,7 @@ mod tests {
first_checkpoint,
)
.await;
let after_first_uploads = RemoteTimelineIndex::try_parse_descriptions_from_paths(
let after_first_uploads = RemoteIndex::try_parse_descriptions_from_paths(
repo_harness.conf,
remote_assets
.0
@@ -462,7 +498,7 @@ mod tests {
0,
)
.await;
assert_index_descriptions(index, after_first_uploads.clone()).await;
assert_index_descriptions(index, &after_first_uploads).await;
let checkpoint_with_uploaded_lsn = create_local_timeline(
&repo_harness,
@@ -478,7 +514,7 @@ mod tests {
0,
)
.await;
assert_index_descriptions(index, after_first_uploads.clone()).await;
assert_index_descriptions(index, &after_first_uploads).await;
Ok(())
}

File diff suppressed because it is too large Load Diff

View File

@@ -3,19 +3,23 @@
use crate::config::PageServerConf;
use crate::layered_repository::LayeredRepository;
use crate::repository::{Repository, Timeline, TimelineSyncState};
use crate::remote_storage::RemoteIndex;
use crate::repository::{Repository, TimelineSyncStatusUpdate};
use crate::thread_mgr;
use crate::thread_mgr::ThreadKind;
use crate::timelines;
use crate::timelines::CreateRepo;
use crate::walredo::PostgresRedoManager;
use crate::CheckpointConfig;
use crate::{DatadirTimelineImpl, RepositoryImpl};
use anyhow::{Context, Result};
use lazy_static::lazy_static;
use log::*;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::collections::hash_map::Entry;
use std::collections::HashMap;
use std::fmt;
use std::sync::{Arc, Mutex, MutexGuard};
use tracing::*;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
lazy_static! {
@@ -24,7 +28,9 @@ lazy_static! {
struct Tenant {
state: TenantState,
repo: Arc<dyn Repository>,
repo: Arc<RepositoryImpl>,
timelines: HashMap<ZTimelineId, Arc<DatadirTimelineImpl>>,
}
#[derive(Debug, Serialize, Deserialize, Clone, Copy, PartialEq, Eq)]
@@ -57,79 +63,68 @@ fn access_tenants() -> MutexGuard<'static, HashMap<ZTenantId, Tenant>> {
TENANTS.lock().unwrap()
}
/// Updates tenants' repositories, changing their timelines state in memory.
pub fn set_timeline_states(
// Sets up wal redo manager and repository for tenant. Reduces code duplocation.
// Used during pageserver startup, or when new tenant is attached to pageserver.
pub fn load_local_repo(
conf: &'static PageServerConf,
timeline_states: HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncState>>,
) {
if timeline_states.is_empty() {
debug!("no timeline state updates to perform");
return;
}
info!("Updating states for {} timelines", timeline_states.len());
trace!("States: {:?}", timeline_states);
tenant_id: ZTenantId,
remote_index: &RemoteIndex,
) -> Arc<RepositoryImpl> {
let mut m = access_tenants();
for (tenant_id, timeline_states) in timeline_states {
let tenant = m.entry(tenant_id).or_insert_with(|| {
// TODO (rodionov) reuse one of the initialisation routines
// Set up a WAL redo manager, for applying WAL records.
let walredo_mgr = PostgresRedoManager::new(conf, tenant_id);
let tenant = m.entry(tenant_id).or_insert_with(|| {
// Set up a WAL redo manager, for applying WAL records.
let walredo_mgr = PostgresRedoManager::new(conf, tenant_id);
// Set up an object repository, for actual data storage.
let repo: Arc<dyn Repository> = Arc::new(LayeredRepository::new(
conf,
Arc::new(walredo_mgr),
tenant_id,
conf.remote_storage_config.is_some(),
));
Tenant {
state: TenantState::Idle,
repo,
}
});
if let Err(e) = put_timelines_into_tenant(tenant, tenant_id, timeline_states) {
error!(
"Failed to update timeline states for tenant {}: {:?}",
tenant_id, e
);
// Set up an object repository, for actual data storage.
let repo: Arc<LayeredRepository> = Arc::new(LayeredRepository::new(
conf,
Arc::new(walredo_mgr),
tenant_id,
remote_index.clone(),
conf.remote_storage_config.is_some(),
));
Tenant {
state: TenantState::Idle,
repo,
timelines: HashMap::new(),
}
}
});
Arc::clone(&tenant.repo)
}
fn put_timelines_into_tenant(
tenant: &mut Tenant,
tenant_id: ZTenantId,
timeline_states: HashMap<ZTimelineId, TimelineSyncState>,
) -> anyhow::Result<()> {
for (timeline_id, timeline_state) in timeline_states {
// If the timeline is being put into any other state than Ready,
// stop any threads operating on it.
//
// FIXME: This is racy. A page service thread could just get
// handle on the Timeline, before we call set_timeline_state()
if !matches!(timeline_state, TimelineSyncState::Ready(_)) {
thread_mgr::shutdown_threads(None, Some(tenant_id), Some(timeline_id));
// Should we run a final checkpoint to flush all the data to
// disk? Doesn't seem necessary; all of the states other than
// Ready imply that the data on local disk is corrupt or incomplete,
// and we don't want to flush that to disk.
}
tenant
.repo
.set_timeline_state(timeline_id, timeline_state)
.with_context(|| {
format!(
"Failed to update timeline {} state to {:?}",
timeline_id, timeline_state
)
})?;
/// Updates tenants' repositories, changing their timelines state in memory.
pub fn apply_timeline_sync_status_updates(
conf: &'static PageServerConf,
remote_index: RemoteIndex,
sync_status_updates: HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncStatusUpdate>>,
) {
if sync_status_updates.is_empty() {
debug!("no sync status updates to apply");
return;
}
info!(
"Applying sync status updates for {} timelines",
sync_status_updates.len()
);
trace!("Sync status updates: {:?}", sync_status_updates);
Ok(())
for (tenant_id, tenant_timelines_sync_status_updates) in sync_status_updates {
let repo = load_local_repo(conf, tenant_id, &remote_index);
for (timeline_id, timeline_sync_status_update) in tenant_timelines_sync_status_updates {
match repo.apply_timeline_remote_sync_status_update(timeline_id, timeline_sync_status_update)
{
Ok(_) => debug!(
"successfully applied timeline sync status update: {} -> {}",
timeline_id, timeline_sync_status_update
),
Err(e) => error!(
"Failed to apply timeline sync status update for tenant {}. timeline {} update {} Error: {:#}",
tenant_id, timeline_id, timeline_sync_status_update, e
),
}
}
}
}
///
@@ -146,7 +141,7 @@ pub fn shutdown_all_tenants() {
thread_mgr::shutdown_threads(Some(ThreadKind::WalReceiver), None, None);
thread_mgr::shutdown_threads(Some(ThreadKind::GarbageCollector), None, None);
thread_mgr::shutdown_threads(Some(ThreadKind::Checkpointer), None, None);
thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), None, None);
// Ok, no background threads running anymore. Flush any remaining data in
// memory to disk.
@@ -160,7 +155,7 @@ pub fn shutdown_all_tenants() {
debug!("shutdown tenant {}", tenantid);
match get_repository_for_tenant(tenantid) {
Ok(repo) => {
if let Err(err) = repo.checkpoint_iteration(CheckpointConfig::Flush) {
if let Err(err) = repo.checkpoint() {
error!(
"Could not checkpoint tenant {} during shutdown: {:?}",
tenantid, err
@@ -179,24 +174,31 @@ pub fn shutdown_all_tenants() {
pub fn create_tenant_repository(
conf: &'static PageServerConf,
new_tenant_id: Option<ZTenantId>,
tenantid: ZTenantId,
remote_index: RemoteIndex,
) -> Result<Option<ZTenantId>> {
let new_tenant_id = new_tenant_id.unwrap_or_else(ZTenantId::generate);
let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, new_tenant_id));
match timelines::create_repo(conf, new_tenant_id, wal_redo_manager)? {
Some(repo) => {
access_tenants()
.entry(new_tenant_id)
.or_insert_with(|| Tenant {
state: TenantState::Idle,
repo,
});
Ok(Some(new_tenant_id))
}
None => {
debug!("repository already exists for tenant {}", new_tenant_id);
match access_tenants().entry(tenantid) {
Entry::Occupied(_) => {
debug!("tenant {} already exists", tenantid);
Ok(None)
}
Entry::Vacant(v) => {
let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, tenantid));
let repo = timelines::create_repo(
conf,
tenantid,
CreateRepo::Real {
wal_redo_manager,
remote_index,
},
)?;
v.insert(Tenant {
state: TenantState::Idle,
repo,
timelines: HashMap::new(),
});
Ok(Some(tenantid))
}
}
}
@@ -205,41 +207,50 @@ pub fn get_tenant_state(tenantid: ZTenantId) -> Option<TenantState> {
}
///
/// Change the state of a tenant to Active and launch its checkpointer and GC
/// Change the state of a tenant to Active and launch its compactor and GC
/// threads. If the tenant was already in Active state or Stopping, does nothing.
///
pub fn activate_tenant(conf: &'static PageServerConf, tenantid: ZTenantId) -> Result<()> {
pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> Result<()> {
let mut m = access_tenants();
let tenant = m
.get_mut(&tenantid)
.with_context(|| format!("Tenant not found for id {}", tenantid))?;
.get_mut(&tenant_id)
.with_context(|| format!("Tenant not found for id {}", tenant_id))?;
info!("activating tenant {}", tenantid);
info!("activating tenant {}", tenant_id);
match tenant.state {
// If the tenant is already active, nothing to do.
TenantState::Active => {}
// If it's Idle, launch the checkpointer and GC threads
// If it's Idle, launch the compactor and GC threads
TenantState::Idle => {
thread_mgr::spawn(
ThreadKind::Checkpointer,
Some(tenantid),
ThreadKind::Compactor,
Some(tenant_id),
None,
"Checkpointer thread",
move || crate::tenant_threads::checkpoint_loop(tenantid, conf),
"Compactor thread",
true,
move || crate::tenant_threads::compact_loop(tenant_id, conf),
)?;
// FIXME: if we fail to launch the GC thread, but already launched the
// checkpointer, we're in a strange state.
thread_mgr::spawn(
let gc_spawn_result = thread_mgr::spawn(
ThreadKind::GarbageCollector,
Some(tenantid),
Some(tenant_id),
None,
"GC thread",
move || crate::tenant_threads::gc_loop(tenantid, conf),
)?;
true,
move || crate::tenant_threads::gc_loop(tenant_id, conf),
)
.with_context(|| format!("Failed to launch GC thread for tenant {}", tenant_id));
if let Err(e) = &gc_spawn_result {
error!(
"Failed to start GC thread for tenant {}, stopping its checkpointer thread: {:?}",
tenant_id, e
);
thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), Some(tenant_id), None);
return gc_spawn_result;
}
tenant.state = TenantState::Active;
}
@@ -251,28 +262,46 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenantid: ZTenantId) -> Re
Ok(())
}
pub fn get_repository_for_tenant(tenantid: ZTenantId) -> Result<Arc<dyn Repository>> {
pub fn get_repository_for_tenant(tenantid: ZTenantId) -> Result<Arc<RepositoryImpl>> {
let m = access_tenants();
let tenant = m
.get(&tenantid)
.with_context(|| format!("Tenant not found for tenant {}", tenantid))?;
.with_context(|| format!("Tenant {} not found", tenantid))?;
Ok(Arc::clone(&tenant.repo))
}
pub fn get_timeline_for_tenant(
// Retrieve timeline for tenant. Load it into memory if it is not already loaded
pub fn get_timeline_for_tenant_load(
tenantid: ZTenantId,
timelineid: ZTimelineId,
) -> Result<Arc<dyn Timeline>> {
get_repository_for_tenant(tenantid)?
.get_timeline(timelineid)?
.local_timeline()
.with_context(|| format!("cannot fetch timeline {}", timelineid))
) -> Result<Arc<DatadirTimelineImpl>> {
let mut m = access_tenants();
let tenant = m
.get_mut(&tenantid)
.with_context(|| format!("Tenant {} not found", tenantid))?;
if let Some(page_tline) = tenant.timelines.get(&timelineid) {
return Ok(Arc::clone(page_tline));
}
// First access to this timeline. Create a DatadirTimeline wrapper for it
let tline = tenant
.repo
.get_timeline_load(timelineid)
.with_context(|| format!("Timeline {} not found for tenant {}", timelineid, tenantid))?;
let repartition_distance = tenant.repo.conf.checkpoint_distance / 10;
let page_tline = Arc::new(DatadirTimelineImpl::new(tline, repartition_distance));
page_tline.init_logical_size()?;
tenant.timelines.insert(timelineid, Arc::clone(&page_tline));
Ok(page_tline)
}
#[serde_as]
#[derive(Serialize, Deserialize, Clone)]
pub struct TenantInfo {
#[serde(with = "hex")]
#[serde_as(as = "DisplayFromStr")]
pub id: ZTenantId,
pub state: TenantState,
}

View File

@@ -1,34 +1,42 @@
//! This module contains functions to serve per-tenant background processes,
//! such as checkpointer and GC
//! such as compaction and GC
use crate::config::PageServerConf;
use crate::repository::Repository;
use crate::tenant_mgr;
use crate::tenant_mgr::TenantState;
use crate::CheckpointConfig;
use anyhow::Result;
use std::time::Duration;
use tracing::*;
use zenith_utils::zid::ZTenantId;
///
/// Checkpointer thread's main loop
/// Compaction thread's main loop
///
pub fn checkpoint_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> {
pub fn compact_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> {
if let Err(err) = compact_loop_ext(tenantid, conf) {
error!("compact loop terminated with error: {:?}", err);
Err(err)
} else {
Ok(())
}
}
fn compact_loop_ext(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> {
loop {
if tenant_mgr::get_tenant_state(tenantid) != Some(TenantState::Active) {
break;
}
std::thread::sleep(conf.checkpoint_period);
trace!("checkpointer thread for tenant {} waking up", tenantid);
std::thread::sleep(conf.compaction_period);
trace!("compaction thread for tenant {} waking up", tenantid);
// checkpoint timelines that have accumulated more than CHECKPOINT_DISTANCE
// bytes of WAL since last checkpoint.
// Compact timelines
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
repo.checkpoint_iteration(CheckpointConfig::Distance(conf.checkpoint_distance))?;
repo.compaction_iteration()?;
}
trace!(
"checkpointer thread stopped for tenant {} state is {:?}",
"compaction thread stopped for tenant {} state is {:?}",
tenantid,
tenant_mgr::get_tenant_state(tenantid)
);
@@ -49,7 +57,7 @@ pub fn gc_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()>
// Garbage collect old files that are not needed for PITR anymore
if conf.gc_horizon > 0 {
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
repo.gc_iteration(None, conf.gc_horizon, false).unwrap();
repo.gc_iteration(None, conf.gc_horizon, false)?;
}
// TODO Write it in more adequate way using

View File

@@ -43,12 +43,14 @@ use std::thread::JoinHandle;
use tokio::sync::watch;
use tracing::{info, warn};
use tracing::{debug, error, info, warn};
use lazy_static::lazy_static;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use crate::shutdown_pageserver;
lazy_static! {
/// Each thread that we track is associated with a "thread ID". It's just
/// an increasing number that we assign, not related to any system thread
@@ -92,13 +94,16 @@ pub enum ThreadKind {
// Thread that connects to a safekeeper to fetch WAL for one timeline.
WalReceiver,
// Thread that handles checkpointing of all timelines for a tenant.
Checkpointer,
// Thread that handles compaction of all timelines for a tenant.
Compactor,
// Thread that handles GC of a tenant
GarbageCollector,
// Thread for synchronizing pageserver relish data with the remote storage.
// Thread that flushes frozen in-memory layers to disk
LayerFlushThread,
// Thread for synchronizing pageserver layer files with the remote storage.
// Shared by all tenants.
StorageSync,
}
@@ -125,15 +130,16 @@ struct PageServerThread {
}
/// Launch a new thread
pub fn spawn<F, E>(
pub fn spawn<F>(
kind: ThreadKind,
tenant_id: Option<ZTenantId>,
timeline_id: Option<ZTimelineId>,
name: &str,
fail_on_error: bool,
f: F,
) -> std::io::Result<()>
where
F: FnOnce() -> Result<(), E> + Send + 'static,
F: FnOnce() -> anyhow::Result<()> + Send + 'static,
{
let (shutdown_tx, shutdown_rx) = watch::channel(());
let thread_id = NEXT_THREAD_ID.fetch_add(1, Ordering::Relaxed);
@@ -160,12 +166,22 @@ where
.insert(thread_id, Arc::clone(&thread_rc));
let thread_rc2 = Arc::clone(&thread_rc);
let thread_name = name.to_string();
let join_handle = match thread::Builder::new()
.name(name.to_string())
.spawn(move || thread_wrapper(thread_id, thread_rc2, shutdown_rx, f))
{
.spawn(move || {
thread_wrapper(
thread_name,
thread_id,
thread_rc2,
shutdown_rx,
fail_on_error,
f,
)
}) {
Ok(handle) => handle,
Err(err) => {
error!("Failed to spawn thread '{}': {}", name, err);
// Could not spawn the thread. Remove the entry
THREADS.lock().unwrap().remove(&thread_id);
return Err(err);
@@ -180,13 +196,15 @@ where
/// This wrapper function runs in a newly-spawned thread. It initializes the
/// thread-local variables and calls the payload function
fn thread_wrapper<F, E>(
fn thread_wrapper<F>(
thread_name: String,
thread_id: u64,
thread: Arc<PageServerThread>,
shutdown_rx: watch::Receiver<()>,
fail_on_error: bool,
f: F,
) where
F: FnOnce() -> Result<(), E> + Send + 'static,
F: FnOnce() -> anyhow::Result<()> + Send + 'static,
{
SHUTDOWN_RX.with(|rx| {
*rx.borrow_mut() = Some(shutdown_rx);
@@ -195,6 +213,8 @@ fn thread_wrapper<F, E>(
*ct.borrow_mut() = Some(thread);
});
debug!("Starting thread '{}'", thread_name);
// We use AssertUnwindSafe here so that the payload function
// doesn't need to be UnwindSafe. We don't do anything after the
// unwinding that would expose us to unwind-unsafe behavior.
@@ -203,9 +223,26 @@ fn thread_wrapper<F, E>(
// Remove our entry from the global hashmap.
THREADS.lock().unwrap().remove(&thread_id);
// If the thread payload panic'd, exit with the panic.
if let Err(err) = result {
panic::resume_unwind(err);
match result {
Ok(Ok(())) => debug!("Thread '{}' exited normally", thread_name),
Ok(Err(err)) => {
if fail_on_error {
error!(
"Shutting down: thread '{}' exited with error: {:?}",
thread_name, err
);
shutdown_pageserver();
} else {
error!("Thread '{}' exited with error: {:?}", thread_name, err);
}
}
Err(err) => {
error!(
"Shutting down: thread '{}' panicked: {:?}",
thread_name, err
);
shutdown_pageserver();
}
}
}
@@ -250,7 +287,7 @@ pub fn shutdown_threads(
let _ = join_handle.join();
} else {
// The thread had not even fully started yet. Or it was shut down
// concurrently and alrady exited
// concurrently and already exited
}
}
}

View File

@@ -2,8 +2,10 @@
//! Timeline management code
//
use anyhow::{anyhow, bail, Context, Result};
use anyhow::{bail, Context, Result};
use postgres_ffi::ControlFileData;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::{
fs,
path::Path,
@@ -16,131 +18,116 @@ use zenith_utils::lsn::Lsn;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use zenith_utils::{crashsafe_dir, logging};
use crate::{config::PageServerConf, repository::Repository};
use crate::{
config::PageServerConf,
layered_repository::metadata::TimelineMetadata,
remote_storage::RemoteIndex,
repository::{LocalTimelineState, Repository},
DatadirTimeline, RepositoryImpl,
};
use crate::{import_datadir, LOG_FILE_NAME};
use crate::{layered_repository::LayeredRepository, walredo::WalRedoManager};
use crate::{repository::RepositoryTimeline, tenant_mgr};
use crate::{repository::Timeline, CheckpointConfig};
#[derive(Clone)]
pub enum TimelineInfo {
Local {
timeline_id: ZTimelineId,
tenant_id: ZTenantId,
last_record_lsn: Lsn,
prev_record_lsn: Lsn,
ancestor_timeline_id: Option<ZTimelineId>,
ancestor_lsn: Option<Lsn>,
disk_consistent_lsn: Lsn,
current_logical_size: usize,
current_logical_size_non_incremental: Option<usize>,
},
Remote {
timeline_id: ZTimelineId,
tenant_id: ZTenantId,
disk_consistent_lsn: Lsn,
},
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct LocalTimelineInfo {
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_timeline_id: Option<ZTimelineId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_lsn: Option<Lsn>,
#[serde_as(as = "DisplayFromStr")]
pub last_record_lsn: Lsn,
#[serde_as(as = "Option<DisplayFromStr>")]
pub prev_record_lsn: Option<Lsn>,
#[serde_as(as = "DisplayFromStr")]
pub disk_consistent_lsn: Lsn,
pub current_logical_size: Option<usize>, // is None when timeline is Unloaded
pub current_logical_size_non_incremental: Option<usize>,
pub timeline_state: LocalTimelineState,
}
impl TimelineInfo {
pub fn from_repo_timeline(
tenant_id: ZTenantId,
repo_timeline: RepositoryTimeline,
impl LocalTimelineInfo {
pub fn from_loaded_timeline<R: Repository>(
datadir_tline: &DatadirTimeline<R>,
include_non_incremental_logical_size: bool,
) -> Self {
match repo_timeline {
RepositoryTimeline::Local { id, timeline } => {
let ancestor_timeline_id = timeline.get_ancestor_timeline_id();
let ancestor_lsn = if ancestor_timeline_id.is_some() {
Some(timeline.get_ancestor_lsn())
} else {
None
};
Self::Local {
timeline_id: id,
tenant_id,
last_record_lsn: timeline.get_last_record_lsn(),
prev_record_lsn: timeline.get_prev_record_lsn(),
ancestor_timeline_id,
ancestor_lsn,
disk_consistent_lsn: timeline.get_disk_consistent_lsn(),
current_logical_size: timeline.get_current_logical_size(),
current_logical_size_non_incremental: get_current_logical_size_non_incremental(
include_non_incremental_logical_size,
timeline.as_ref(),
),
) -> anyhow::Result<Self> {
let last_record_lsn = datadir_tline.tline.get_last_record_lsn();
let info = LocalTimelineInfo {
ancestor_timeline_id: datadir_tline.tline.get_ancestor_timeline_id(),
ancestor_lsn: {
match datadir_tline.tline.get_ancestor_lsn() {
Lsn(0) => None,
lsn @ Lsn(_) => Some(lsn),
}
}
RepositoryTimeline::Remote {
id,
disk_consistent_lsn,
} => Self::Remote {
timeline_id: id,
tenant_id,
disk_consistent_lsn,
},
disk_consistent_lsn: datadir_tline.tline.get_disk_consistent_lsn(),
last_record_lsn,
prev_record_lsn: Some(datadir_tline.tline.get_prev_record_lsn()),
timeline_state: LocalTimelineState::Loaded,
current_logical_size: Some(datadir_tline.get_current_logical_size()),
current_logical_size_non_incremental: if include_non_incremental_logical_size {
Some(datadir_tline.get_current_logical_size_non_incremental(last_record_lsn)?)
} else {
None
},
};
Ok(info)
}
pub fn from_unloaded_timeline(metadata: &TimelineMetadata) -> Self {
LocalTimelineInfo {
ancestor_timeline_id: metadata.ancestor_timeline(),
ancestor_lsn: {
match metadata.ancestor_lsn() {
Lsn(0) => None,
lsn @ Lsn(_) => Some(lsn),
}
},
disk_consistent_lsn: metadata.disk_consistent_lsn(),
last_record_lsn: metadata.disk_consistent_lsn(),
prev_record_lsn: metadata.prev_record_lsn(),
timeline_state: LocalTimelineState::Unloaded,
current_logical_size: None,
current_logical_size_non_incremental: None,
}
}
pub fn from_dyn_timeline(
pub fn from_repo_timeline<T>(
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
timeline: &dyn Timeline,
repo_timeline: &RepositoryTimeline<T>,
include_non_incremental_logical_size: bool,
) -> Self {
let ancestor_timeline_id = timeline.get_ancestor_timeline_id();
let ancestor_lsn = if ancestor_timeline_id.is_some() {
Some(timeline.get_ancestor_lsn())
} else {
None
};
Self::Local {
timeline_id,
tenant_id,
last_record_lsn: timeline.get_last_record_lsn(),
prev_record_lsn: timeline.get_prev_record_lsn(),
ancestor_timeline_id,
ancestor_lsn,
disk_consistent_lsn: timeline.get_disk_consistent_lsn(),
current_logical_size: timeline.get_current_logical_size(),
current_logical_size_non_incremental: get_current_logical_size_non_incremental(
include_non_incremental_logical_size,
timeline,
),
}
}
pub fn timeline_id(&self) -> ZTimelineId {
match *self {
TimelineInfo::Local { timeline_id, .. } => timeline_id,
TimelineInfo::Remote { timeline_id, .. } => timeline_id,
}
}
pub fn tenant_id(&self) -> ZTenantId {
match *self {
TimelineInfo::Local { tenant_id, .. } => tenant_id,
TimelineInfo::Remote { tenant_id, .. } => tenant_id,
) -> anyhow::Result<Self> {
match repo_timeline {
RepositoryTimeline::Loaded(_) => {
let datadir_tline =
tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id)?;
Self::from_loaded_timeline(&datadir_tline, include_non_incremental_logical_size)
}
RepositoryTimeline::Unloaded { metadata } => Ok(Self::from_unloaded_timeline(metadata)),
}
}
}
fn get_current_logical_size_non_incremental(
include_non_incremental_logical_size: bool,
timeline: &dyn Timeline,
) -> Option<usize> {
if !include_non_incremental_logical_size {
return None;
}
match timeline.get_current_logical_size_non_incremental(timeline.get_last_record_lsn()) {
Ok(size) => Some(size),
Err(e) => {
error!("Failed to get non-incremental logical size: {:?}", e);
None
}
}
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct RemoteTimelineInfo {
#[serde_as(as = "Option<DisplayFromStr>")]
pub remote_consistent_lsn: Option<Lsn>,
pub awaits_download: bool,
}
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct TimelineInfo {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: ZTenantId,
#[serde_as(as = "DisplayFromStr")]
pub timeline_id: ZTimelineId,
pub local: Option<LocalTimelineInfo>,
pub remote: Option<RemoteTimelineInfo>,
}
#[derive(Debug, Clone, Copy)]
@@ -158,25 +145,12 @@ pub fn init_pageserver(
// use true as daemonize parameter because otherwise we pollute zenith cli output with a few pages long output of info messages
let _log_file = logging::init(LOG_FILE_NAME, true)?;
// We don't use the real WAL redo manager, because we don't want to spawn the WAL redo
// process during repository initialization.
//
// FIXME: That caused trouble, because the WAL redo manager spawned a thread that launched
// initdb in the background, and it kept running even after the "zenith init" had exited.
// In tests, we started the page server immediately after that, so that initdb was still
// running in the background, and we failed to run initdb again in the same directory. This
// has been solved for the rapid init+start case now, but the general race condition remains
// if you restart the server quickly. The WAL redo manager doesn't use a separate thread
// anymore, but I think that could still happen.
let dummy_redo_mgr = Arc::new(crate::walredo::DummyRedoManager {});
crashsafe_dir::create_dir_all(conf.tenants_path())?;
if let Some(tenant_id) = create_tenant {
println!("initializing tenantid {}", tenant_id);
let repo = create_repo(conf, tenant_id, dummy_redo_mgr)
.context("failed to create repo")?
.ok_or_else(|| anyhow!("For newely created pageserver, found already existing repository for tenant {}", tenant_id))?;
let repo =
create_repo(conf, tenant_id, CreateRepo::Dummy).context("failed to create repo")?;
let new_timeline_id = initial_timeline_id.unwrap_or_else(ZTimelineId::generate);
bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref())
.context("failed to create initial timeline")?;
@@ -189,15 +163,44 @@ pub fn init_pageserver(
Ok(())
}
pub enum CreateRepo {
Real {
wal_redo_manager: Arc<dyn WalRedoManager + Send + Sync>,
remote_index: RemoteIndex,
},
Dummy,
}
pub fn create_repo(
conf: &'static PageServerConf,
tenant_id: ZTenantId,
wal_redo_manager: Arc<dyn WalRedoManager + Send + Sync>,
) -> Result<Option<Arc<dyn Repository>>> {
create_repo: CreateRepo,
) -> Result<Arc<RepositoryImpl>> {
let (wal_redo_manager, remote_index) = match create_repo {
CreateRepo::Real {
wal_redo_manager,
remote_index,
} => (wal_redo_manager, remote_index),
CreateRepo::Dummy => {
// We don't use the real WAL redo manager, because we don't want to spawn the WAL redo
// process during repository initialization.
//
// FIXME: That caused trouble, because the WAL redo manager spawned a thread that launched
// initdb in the background, and it kept running even after the "zenith init" had exited.
// In tests, we started the page server immediately after that, so that initdb was still
// running in the background, and we failed to run initdb again in the same directory. This
// has been solved for the rapid init+start case now, but the general race condition remains
// if you restart the server quickly. The WAL redo manager doesn't use a separate thread
// anymore, but I think that could still happen.
let wal_redo_manager = Arc::new(crate::walredo::DummyRedoManager {});
(wal_redo_manager as _, RemoteIndex::empty())
}
};
let repo_dir = conf.tenant_path(&tenant_id);
if repo_dir.exists() {
debug!("repo for {} already exists", tenant_id);
return Ok(None);
bail!("tenant {} directory already exists", tenant_id);
}
// top-level dir may exist if we are creating it through CLI
@@ -206,12 +209,13 @@ pub fn create_repo(
crashsafe_dir::create_dir(conf.timelines_path(&tenant_id))?;
info!("created directory structure in {}", repo_dir.display());
Ok(Some(Arc::new(LayeredRepository::new(
Ok(Arc::new(LayeredRepository::new(
conf,
wal_redo_manager,
tenant_id,
remote_index,
conf.remote_storage_config.is_some(),
))))
)))
}
// Returns checkpoint LSN from controlfile
@@ -232,7 +236,7 @@ fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> {
let initdb_path = conf.pg_bin_dir().join("initdb");
let initdb_output = Command::new(initdb_path)
.args(&["-D", initdbpath.to_str().unwrap()])
.args(&["-D", &initdbpath.to_string_lossy()])
.args(&["-U", &conf.superuser])
.args(&["-E", "utf8"])
.arg("--no-instructions")
@@ -240,8 +244,8 @@ fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> {
// so no need to fsync it
.arg("--no-sync")
.env_clear()
.env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("LD_LIBRARY_PATH", conf.pg_lib_dir())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir())
.stdout(Stdio::null())
.output()
.context("failed to execute initdb")?;
@@ -259,12 +263,12 @@ fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> {
// - run initdb to init temporary instance and get bootstrap data
// - after initialization complete, remove the temp dir.
//
fn bootstrap_timeline(
fn bootstrap_timeline<R: Repository>(
conf: &'static PageServerConf,
tenantid: ZTenantId,
tli: ZTimelineId,
repo: &dyn Repository,
) -> Result<Arc<dyn Timeline>> {
repo: &R,
) -> Result<()> {
let _enter = info_span!("bootstrapping", timeline = %tli, tenant = %tenantid).entered();
let initdb_path = conf.tenant_path(&tenantid).join("tmp");
@@ -280,49 +284,43 @@ fn bootstrap_timeline(
// Initdb lsn will be equal to last_record_lsn which will be set after import.
// Because we know it upfront avoid having an option or dummy zero value by passing it to create_empty_timeline.
let timeline = repo.create_empty_timeline(tli, lsn)?;
import_datadir::import_timeline_from_postgres_datadir(
&pgdata_path,
timeline.writer().as_ref(),
lsn,
)?;
timeline.checkpoint(CheckpointConfig::Forced)?;
let mut page_tline: DatadirTimeline<R> = DatadirTimeline::new(timeline, u64::MAX);
import_datadir::import_timeline_from_postgres_datadir(&pgdata_path, &mut page_tline, lsn)?;
page_tline.tline.checkpoint(CheckpointConfig::Forced)?;
println!(
"created initial timeline {} timeline.lsn {}",
tli,
timeline.get_last_record_lsn()
page_tline.tline.get_last_record_lsn()
);
// Remove temp dir. We don't need it anymore
fs::remove_dir_all(pgdata_path)?;
Ok(timeline)
Ok(())
}
pub(crate) fn get_timelines(
pub(crate) fn get_local_timelines(
tenant_id: ZTenantId,
include_non_incremental_logical_size: bool,
) -> Result<Vec<TimelineInfo>> {
) -> Result<Vec<(ZTimelineId, LocalTimelineInfo)>> {
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)
.with_context(|| format!("Failed to get repo for tenant {}", tenant_id))?;
let repo_timelines = repo.list_timelines();
Ok(repo
.list_timelines()
.with_context(|| format!("Failed to list timelines for tenant {}", tenant_id))?
.into_iter()
.filter_map(|timeline| match timeline {
RepositoryTimeline::Local { timeline, id } => Some((id, timeline)),
RepositoryTimeline::Remote { .. } => None,
})
.map(|(timeline_id, timeline)| {
TimelineInfo::from_dyn_timeline(
let mut local_timeline_info = Vec::with_capacity(repo_timelines.len());
for (timeline_id, repository_timeline) in repo_timelines {
local_timeline_info.push((
timeline_id,
LocalTimelineInfo::from_repo_timeline(
tenant_id,
timeline_id,
timeline.as_ref(),
&repository_timeline,
include_non_incremental_logical_size,
)
})
.collect())
)?,
))
}
Ok(local_timeline_info)
}
pub(crate) fn create_timeline(
@@ -336,16 +334,8 @@ pub(crate) fn create_timeline(
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
if conf.timeline_path(&new_timeline_id, &tenant_id).exists() {
match repo.get_timeline(new_timeline_id)? {
RepositoryTimeline::Local { id, .. } => {
debug!("timeline {} already exists", id);
return Ok(None);
}
RepositoryTimeline::Remote { id, .. } => bail!(
"timeline {} already exists in pageserver's remote storage",
id
),
}
debug!("timeline {} already exists", new_timeline_id);
return Ok(None);
}
let mut start_lsn = ancestor_start_lsn.unwrap_or(Lsn(0));
@@ -353,15 +343,8 @@ pub(crate) fn create_timeline(
let new_timeline_info = match ancestor_timeline_id {
Some(ancestor_timeline_id) => {
let ancestor_timeline = repo
.get_timeline(ancestor_timeline_id)
.with_context(|| format!("Cannot get ancestor timeline {}", ancestor_timeline_id))?
.local_timeline()
.with_context(|| {
format!(
"Cannot branch off the timeline {} that's not present locally",
ancestor_timeline_id
)
})?;
.get_timeline_load(ancestor_timeline_id)
.context("Cannot branch off the timeline that's not present locally")?;
if start_lsn == Lsn(0) {
// Find end of WAL on the old timeline
@@ -391,18 +374,24 @@ pub(crate) fn create_timeline(
}
repo.branch_timeline(ancestor_timeline_id, new_timeline_id, start_lsn)?;
// load the timeline into memory
let loaded_timeline = repo.get_timeline(new_timeline_id)?;
TimelineInfo::from_repo_timeline(tenant_id, loaded_timeline, false)
let loaded_timeline =
tenant_mgr::get_timeline_for_tenant_load(tenant_id, new_timeline_id)?;
LocalTimelineInfo::from_loaded_timeline(&loaded_timeline, false)
.context("cannot fill timeline info")?
}
None => {
let new_timeline = bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref())?;
TimelineInfo::from_dyn_timeline(
tenant_id,
new_timeline_id,
new_timeline.as_ref(),
false,
)
bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref())?;
// load the timeline into memory
let new_timeline =
tenant_mgr::get_timeline_for_tenant_load(tenant_id, new_timeline_id)?;
LocalTimelineInfo::from_loaded_timeline(&new_timeline, false)
.context("cannot fill timeline info")?
}
};
Ok(Some(new_timeline_info))
Ok(Some(TimelineInfo {
tenant_id,
timeline_id: new_timeline_id,
local: Some(new_timeline_info),
remote: None,
}))
}

View File

@@ -65,6 +65,7 @@ lazy_static! {
/// currently open, the 'handle' can still point to the slot where it was last kept. The
/// 'tag' field is used to detect whether the handle still is valid or not.
///
#[derive(Debug)]
pub struct VirtualFile {
/// Lazy handle to the global file descriptor cache. The slot that this points to
/// might contain our File, or it may be empty, or it may contain a File that
@@ -88,7 +89,7 @@ pub struct VirtualFile {
timelineid: String,
}
#[derive(PartialEq, Clone, Copy)]
#[derive(Debug, PartialEq, Clone, Copy)]
struct SlotHandle {
/// Index into OPEN_FILES.slots
index: usize,
@@ -226,7 +227,8 @@ impl VirtualFile {
path: &Path,
open_options: &OpenOptions,
) -> Result<VirtualFile, std::io::Error> {
let parts = path.to_str().unwrap().split('/').collect::<Vec<&str>>();
let path_str = path.to_string_lossy();
let parts = path_str.split('/').collect::<Vec<&str>>();
let tenantid;
let timelineid;
if parts.len() > 5 && parts[parts.len() - 5] == "tenants" {

File diff suppressed because it is too large Load Diff

View File

@@ -6,6 +6,7 @@
//! We keep one WAL receiver active per timeline.
use crate::config::PageServerConf;
use crate::repository::{Repository, Timeline};
use crate::tenant_mgr;
use crate::thread_mgr;
use crate::thread_mgr::ThreadKind;
@@ -31,6 +32,7 @@ use tracing::*;
use zenith_utils::lsn::Lsn;
use zenith_utils::pq_proto::ZenithFeedback;
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTenantTimelineId;
use zenith_utils::zid::ZTimelineId;
//
@@ -68,7 +70,7 @@ pub fn launch_wal_receiver(
match receivers.get_mut(&(tenantid, timelineid)) {
Some(receiver) => {
info!("wal receiver already running, updating connection string");
debug!("wal receiver already running, updating connection string");
receiver.wal_producer_connstr = wal_producer_connstr.into();
}
None => {
@@ -77,9 +79,11 @@ pub fn launch_wal_receiver(
Some(tenantid),
Some(timelineid),
"WAL receiver thread",
false,
move || {
IS_WAL_RECEIVER.with(|c| c.set(true));
thread_main(conf, tenantid, timelineid)
thread_main(conf, tenantid, timelineid);
Ok(())
},
)?;
@@ -109,20 +113,16 @@ fn get_wal_producer_connstr(tenantid: ZTenantId, timelineid: ZTimelineId) -> Str
//
// This is the entry point for the WAL receiver thread.
//
fn thread_main(
conf: &'static PageServerConf,
tenantid: ZTenantId,
timelineid: ZTimelineId,
) -> Result<()> {
let _enter = info_span!("WAL receiver", timeline = %timelineid, tenant = %tenantid).entered();
fn thread_main(conf: &'static PageServerConf, tenant_id: ZTenantId, timeline_id: ZTimelineId) {
let _enter = info_span!("WAL receiver", timeline = %timeline_id, tenant = %tenant_id).entered();
info!("WAL receiver thread started");
// Look up the current WAL producer address
let wal_producer_connstr = get_wal_producer_connstr(tenantid, timelineid);
let wal_producer_connstr = get_wal_producer_connstr(tenant_id, timeline_id);
// Make a connection to the WAL safekeeper, or directly to the primary PostgreSQL server,
// and start streaming WAL from it.
let res = walreceiver_main(conf, tenantid, timelineid, &wal_producer_connstr);
let res = walreceiver_main(conf, tenant_id, timeline_id, &wal_producer_connstr);
// TODO cleanup info messages
if let Err(e) = res {
@@ -130,22 +130,21 @@ fn thread_main(
} else {
info!(
"walreceiver disconnected tenant {}, timelineid {}",
tenantid, timelineid
tenant_id, timeline_id
);
}
// Drop it from list of active WAL_RECEIVERS
// so that next callmemaybe request launched a new thread
drop_wal_receiver(tenantid, timelineid);
Ok(())
drop_wal_receiver(tenant_id, timeline_id);
}
fn walreceiver_main(
_conf: &PageServerConf,
tenantid: ZTenantId,
timelineid: ZTimelineId,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
wal_producer_connstr: &str,
) -> Result<(), Error> {
) -> anyhow::Result<(), Error> {
// Connect to the database in replication mode.
info!("connecting to {:?}", wal_producer_connstr);
let connect_cfg = format!(
@@ -182,13 +181,16 @@ fn walreceiver_main(
let end_of_wal = Lsn::from(u64::from(identify.xlogpos));
let mut caught_up = false;
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)
.with_context(|| format!("no repository found for tenant {}", tenant_id))?;
let timeline =
tenant_mgr::get_timeline_for_tenant(tenantid, timelineid).with_context(|| {
tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id).with_context(|| {
format!(
"Can not start the walrecever for a remote tenant {}, timeline {}",
tenantid, timelineid,
"local timeline {} not found for tenant {}",
timeline_id, tenant_id
)
})?;
let remote_index = repo.get_remote_index();
//
// Start streaming the WAL, from where we left off previously.
@@ -250,11 +252,10 @@ fn walreceiver_main(
// It is important to deal with the aligned records as lsn in getPage@LSN is
// aligned and can be several bytes bigger. Without this alignment we are
// at risk of hittind a deadlock.
assert!(lsn.is_aligned());
// at risk of hitting a deadlock.
anyhow::ensure!(lsn.is_aligned());
let writer = timeline.writer();
walingest.ingest_record(writer.as_ref(), recdata, lsn)?;
walingest.ingest_record(&timeline, recdata, lsn)?;
fail_point!("walreceiver-after-ingest");
@@ -266,6 +267,8 @@ fn walreceiver_main(
caught_up = true;
}
timeline.tline.check_checkpoint_distance()?;
Some(endlsn)
}
@@ -292,19 +295,27 @@ fn walreceiver_main(
};
if let Some(last_lsn) = status_update {
let timeline_synced_disk_consistent_lsn =
tenant_mgr::get_repository_for_tenant(tenantid)?
.get_timeline_state(timelineid)
.and_then(|state| state.remote_disk_consistent_lsn())
.unwrap_or(Lsn(0));
let timeline_remote_consistent_lsn = runtime.block_on(async {
remote_index
.read()
.await
// here we either do not have this timeline in remote index
// or there were no checkpoints for it yet
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.and_then(|e| e.disk_consistent_lsn())
.unwrap_or(Lsn(0)) // no checkpoint was uploaded
});
// The last LSN we processed. It is not guaranteed to survive pageserver crash.
let write_lsn = u64::from(last_lsn);
// `disk_consistent_lsn` is the LSN at which page server guarantees local persistence of all received data
let flush_lsn = u64::from(timeline.get_disk_consistent_lsn());
let flush_lsn = u64::from(timeline.tline.get_disk_consistent_lsn());
// The last LSN that is synced to remote storage and is guaranteed to survive pageserver crash
// Used by safekeepers to remove WAL preceding `remote_consistent_lsn`.
let apply_lsn = u64::from(timeline_synced_disk_consistent_lsn);
let apply_lsn = u64::from(timeline_remote_consistent_lsn);
let ts = SystemTime::now();
// Send zenith feedback message.

View File

@@ -10,7 +10,47 @@ use postgres_ffi::{MultiXactId, MultiXactOffset, MultiXactStatus, Oid, Transacti
use serde::{Deserialize, Serialize};
use tracing::*;
use crate::repository::ZenithWalRecord;
/// Each update to a page is represented by a ZenithWalRecord. It can be a wrapper
/// around a PostgreSQL WAL record, or a custom zenith-specific "record".
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub enum ZenithWalRecord {
/// Native PostgreSQL WAL record
Postgres { will_init: bool, rec: Bytes },
/// Clear bits in heap visibility map. ('flags' is bitmap of bits to clear)
ClearVisibilityMapFlags {
new_heap_blkno: Option<u32>,
old_heap_blkno: Option<u32>,
flags: u8,
},
/// Mark transaction IDs as committed on a CLOG page
ClogSetCommitted { xids: Vec<TransactionId> },
/// Mark transaction IDs as aborted on a CLOG page
ClogSetAborted { xids: Vec<TransactionId> },
/// Extend multixact offsets SLRU
MultixactOffsetCreate {
mid: MultiXactId,
moff: MultiXactOffset,
},
/// Extend multixact members SLRU.
MultixactMembersCreate {
moff: MultiXactOffset,
members: Vec<MultiXactMember>,
},
}
impl ZenithWalRecord {
/// Does replaying this WAL record initialize the page from scratch, or does
/// it need to be applied over the previous image of the page?
pub fn will_init(&self) -> bool {
match self {
ZenithWalRecord::Postgres { will_init, rec: _ } => *will_init,
// None of the special zenith record types currently initialize the page
_ => false,
}
}
}
/// DecodedBkpBlock represents per-page data contained in a WAL record.
#[derive(Default)]
@@ -87,6 +127,28 @@ impl XlRelmapUpdate {
}
}
#[repr(C)]
#[derive(Debug)]
pub struct XlSmgrCreate {
pub rnode: RelFileNode,
// FIXME: This is ForkNumber in storage_xlog.h. That's an enum. Does it have
// well-defined size?
pub forknum: u8,
}
impl XlSmgrCreate {
pub fn decode(buf: &mut Bytes) -> XlSmgrCreate {
XlSmgrCreate {
rnode: RelFileNode {
spcnode: buf.get_u32_le(), /* tablespace */
dbnode: buf.get_u32_le(), /* database */
relnode: buf.get_u32_le(), /* relation */
},
forknum: buf.get_u32_le() as u8,
}
}
}
#[repr(C)]
#[derive(Debug)]
pub struct XlSmgrTruncate {

View File

@@ -21,7 +21,6 @@
use byteorder::{ByteOrder, LittleEndian};
use bytes::{BufMut, Bytes, BytesMut};
use lazy_static::lazy_static;
use log::*;
use nix::poll::*;
use serde::Serialize;
use std::fs;
@@ -35,6 +34,7 @@ use std::process::{Child, ChildStderr, ChildStdin, ChildStdout, Command};
use std::sync::Mutex;
use std::time::Duration;
use std::time::Instant;
use tracing::*;
use zenith_metrics::{register_histogram, register_int_counter, Histogram, IntCounter};
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
@@ -42,8 +42,10 @@ use zenith_utils::nonblock::set_nonblock;
use zenith_utils::zid::ZTenantId;
use crate::config::PageServerConf;
use crate::relish::*;
use crate::repository::ZenithWalRecord;
use crate::pgdatadir_mapping::{key_to_rel_block, key_to_slru_block};
use crate::reltag::{RelTag, SlruKind};
use crate::repository::Key;
use crate::walrecord::ZenithWalRecord;
use postgres_ffi::nonrelfile_utils::mx_offset_to_flags_bitshift;
use postgres_ffi::nonrelfile_utils::mx_offset_to_flags_offset;
use postgres_ffi::nonrelfile_utils::mx_offset_to_member_offset;
@@ -75,8 +77,7 @@ pub trait WalRedoManager: Send + Sync {
/// the reords.
fn request_redo(
&self,
rel: RelishTag,
blknum: u32,
key: Key,
lsn: Lsn,
base_img: Option<Bytes>,
records: Vec<(Lsn, ZenithWalRecord)>,
@@ -92,8 +93,7 @@ pub struct DummyRedoManager {}
impl crate::walredo::WalRedoManager for DummyRedoManager {
fn request_redo(
&self,
_rel: RelishTag,
_blknum: u32,
_key: Key,
_lsn: Lsn,
_base_img: Option<Bytes>,
_records: Vec<(Lsn, ZenithWalRecord)>,
@@ -152,28 +152,6 @@ fn can_apply_in_zenith(rec: &ZenithWalRecord) -> bool {
}
}
fn check_forknum(rel: &RelishTag, expected_forknum: u8) -> bool {
if let RelishTag::Relation(RelTag {
forknum,
spcnode: _,
dbnode: _,
relnode: _,
}) = rel
{
*forknum == expected_forknum
} else {
false
}
}
fn check_slru_segno(rel: &RelishTag, expected_slru: SlruKind, expected_segno: u32) -> bool {
if let RelishTag::Slru { slru, segno } = rel {
*slru == expected_slru && *segno == expected_segno
} else {
false
}
}
/// An error happened in WAL redo
#[derive(Debug, thiserror::Error)]
pub enum WalRedoError {
@@ -184,6 +162,8 @@ pub enum WalRedoError {
InvalidState,
#[error("cannot perform WAL redo for this request")]
InvalidRequest,
#[error("cannot perform WAL redo for this record")]
InvalidRecord,
}
///
@@ -198,8 +178,7 @@ impl WalRedoManager for PostgresRedoManager {
///
fn request_redo(
&self,
rel: RelishTag,
blknum: u32,
key: Key,
lsn: Lsn,
base_img: Option<Bytes>,
records: Vec<(Lsn, ZenithWalRecord)>,
@@ -217,11 +196,10 @@ impl WalRedoManager for PostgresRedoManager {
if rec_zenith != batch_zenith {
let result = if batch_zenith {
self.apply_batch_zenith(rel, blknum, lsn, img, &records[batch_start..i])
self.apply_batch_zenith(key, lsn, img, &records[batch_start..i])
} else {
self.apply_batch_postgres(
rel,
blknum,
key,
lsn,
img,
&records[batch_start..i],
@@ -236,11 +214,10 @@ impl WalRedoManager for PostgresRedoManager {
}
// last batch
if batch_zenith {
self.apply_batch_zenith(rel, blknum, lsn, img, &records[batch_start..])
self.apply_batch_zenith(key, lsn, img, &records[batch_start..])
} else {
self.apply_batch_postgres(
rel,
blknum,
key,
lsn,
img,
&records[batch_start..],
@@ -268,16 +245,15 @@ impl PostgresRedoManager {
///
fn apply_batch_postgres(
&self,
rel: RelishTag,
blknum: u32,
key: Key,
lsn: Lsn,
base_img: Option<Bytes>,
records: &[(Lsn, ZenithWalRecord)],
wal_redo_timeout: Duration,
) -> Result<Bytes, WalRedoError> {
let start_time = Instant::now();
let (rel, blknum) = key_to_rel_block(key).or(Err(WalRedoError::InvalidRecord))?;
let apply_result: Result<Bytes, Error>;
let start_time = Instant::now();
let mut process_guard = self.process.lock().unwrap();
let lock_time = Instant::now();
@@ -291,16 +267,11 @@ impl PostgresRedoManager {
WAL_REDO_WAIT_TIME.observe(lock_time.duration_since(start_time).as_secs_f64());
let result = if let RelishTag::Relation(rel) = rel {
// Relational WAL records are applied using wal-redo-postgres
let buf_tag = BufferTag { rel, blknum };
apply_result = process.apply_wal_records(buf_tag, base_img, records, wal_redo_timeout);
apply_result.map_err(WalRedoError::IoError)
} else {
error!("unexpected non-relation relish: {:?}", rel);
Err(WalRedoError::InvalidRequest)
};
// Relational WAL records are applied using wal-redo-postgres
let buf_tag = BufferTag { rel, blknum };
let result = process
.apply_wal_records(buf_tag, base_img, records, wal_redo_timeout)
.map_err(WalRedoError::IoError);
let end_time = Instant::now();
let duration = end_time.duration_since(lock_time);
@@ -326,8 +297,7 @@ impl PostgresRedoManager {
///
fn apply_batch_zenith(
&self,
rel: RelishTag,
blknum: u32,
key: Key,
lsn: Lsn,
base_img: Option<Bytes>,
records: &[(Lsn, ZenithWalRecord)],
@@ -346,7 +316,7 @@ impl PostgresRedoManager {
// Apply all the WAL records in the batch
for (record_lsn, record) in records.iter() {
self.apply_record_zenith(rel, blknum, &mut page, *record_lsn, record)?;
self.apply_record_zenith(key, &mut page, *record_lsn, record)?;
}
// Success!
let end_time = Instant::now();
@@ -365,8 +335,7 @@ impl PostgresRedoManager {
fn apply_record_zenith(
&self,
rel: RelishTag,
blknum: u32,
key: Key,
page: &mut BytesMut,
_record_lsn: Lsn,
record: &ZenithWalRecord,
@@ -375,16 +344,20 @@ impl PostgresRedoManager {
ZenithWalRecord::Postgres {
will_init: _,
rec: _,
} => panic!("tried to pass postgres wal record to zenith WAL redo"),
} => {
error!("tried to pass postgres wal record to zenith WAL redo");
return Err(WalRedoError::InvalidRequest);
}
ZenithWalRecord::ClearVisibilityMapFlags {
new_heap_blkno,
old_heap_blkno,
flags,
} => {
// sanity check that this is modifying the correct relish
// sanity check that this is modifying the correct relation
let (rel, blknum) = key_to_rel_block(key).or(Err(WalRedoError::InvalidRecord))?;
assert!(
check_forknum(&rel, pg_constants::VISIBILITYMAP_FORKNUM),
"ClearVisibilityMapFlags record on unexpected rel {:?}",
rel.forknum == pg_constants::VISIBILITYMAP_FORKNUM,
"ClearVisibilityMapFlags record on unexpected rel {}",
rel
);
if let Some(heap_blkno) = *new_heap_blkno {
@@ -418,6 +391,14 @@ impl PostgresRedoManager {
// Non-relational WAL records are handled here, with custom code that has the
// same effects as the corresponding Postgres WAL redo function.
ZenithWalRecord::ClogSetCommitted { xids } => {
let (slru_kind, segno, blknum) =
key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?;
assert_eq!(
slru_kind,
SlruKind::Clog,
"ClogSetCommitted record with unexpected key {}",
key
);
for &xid in xids {
let pageno = xid as u32 / pg_constants::CLOG_XACTS_PER_PAGE;
let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
@@ -425,12 +406,17 @@ impl PostgresRedoManager {
// Check that we're modifying the correct CLOG block.
assert!(
check_slru_segno(&rel, SlruKind::Clog, expected_segno),
"ClogSetCommitted record for XID {} with unexpected rel {:?}",
segno == expected_segno,
"ClogSetCommitted record for XID {} with unexpected key {}",
xid,
rel
key
);
assert!(
blknum == expected_blknum,
"ClogSetCommitted record for XID {} with unexpected key {}",
xid,
key
);
assert!(blknum == expected_blknum);
transaction_id_set_status(
xid,
@@ -440,6 +426,14 @@ impl PostgresRedoManager {
}
}
ZenithWalRecord::ClogSetAborted { xids } => {
let (slru_kind, segno, blknum) =
key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?;
assert_eq!(
slru_kind,
SlruKind::Clog,
"ClogSetAborted record with unexpected key {}",
key
);
for &xid in xids {
let pageno = xid as u32 / pg_constants::CLOG_XACTS_PER_PAGE;
let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
@@ -447,17 +441,30 @@ impl PostgresRedoManager {
// Check that we're modifying the correct CLOG block.
assert!(
check_slru_segno(&rel, SlruKind::Clog, expected_segno),
"ClogSetCommitted record for XID {} with unexpected rel {:?}",
segno == expected_segno,
"ClogSetAborted record for XID {} with unexpected key {}",
xid,
rel
key
);
assert!(
blknum == expected_blknum,
"ClogSetAborted record for XID {} with unexpected key {}",
xid,
key
);
assert!(blknum == expected_blknum);
transaction_id_set_status(xid, pg_constants::TRANSACTION_STATUS_ABORTED, page);
}
}
ZenithWalRecord::MultixactOffsetCreate { mid, moff } => {
let (slru_kind, segno, blknum) =
key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?;
assert_eq!(
slru_kind,
SlruKind::MultiXactOffsets,
"MultixactOffsetCreate record with unexpected key {}",
key
);
// Compute the block and offset to modify.
// See RecordNewMultiXact in PostgreSQL sources.
let pageno = mid / pg_constants::MULTIXACT_OFFSETS_PER_PAGE as u32;
@@ -468,16 +475,29 @@ impl PostgresRedoManager {
let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let expected_blknum = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
assert!(
check_slru_segno(&rel, SlruKind::MultiXactOffsets, expected_segno),
"MultiXactOffsetsCreate record for multi-xid {} with unexpected rel {:?}",
segno == expected_segno,
"MultiXactOffsetsCreate record for multi-xid {} with unexpected key {}",
mid,
rel
key
);
assert!(
blknum == expected_blknum,
"MultiXactOffsetsCreate record for multi-xid {} with unexpected key {}",
mid,
key
);
assert!(blknum == expected_blknum);
LittleEndian::write_u32(&mut page[offset..offset + 4], *moff);
}
ZenithWalRecord::MultixactMembersCreate { moff, members } => {
let (slru_kind, segno, blknum) =
key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?;
assert_eq!(
slru_kind,
SlruKind::MultiXactMembers,
"MultixactMembersCreate record with unexpected key {}",
key
);
for (i, member) in members.iter().enumerate() {
let offset = moff + i as u32;
@@ -492,12 +512,17 @@ impl PostgresRedoManager {
let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let expected_blknum = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
assert!(
check_slru_segno(&rel, SlruKind::MultiXactMembers, expected_segno),
"MultiXactMembersCreate record at offset {} with unexpected rel {:?}",
segno == expected_segno,
"MultiXactMembersCreate record for offset {} with unexpected key {}",
moff,
rel
key
);
assert!(
blknum == expected_blknum,
"MultiXactMembersCreate record for offset {} with unexpected key {}",
moff,
key
);
assert!(blknum == expected_blknum);
let mut flagsval = LittleEndian::read_u32(&page[flagsoff..flagsoff + 4]);
flagsval &= !(((1 << pg_constants::MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift);
@@ -541,20 +566,23 @@ impl PostgresRedoProcess {
}
info!("running initdb in {:?}", datadir.display());
let initdb = Command::new(conf.pg_bin_dir().join("initdb"))
.args(&["-D", datadir.to_str().unwrap()])
.args(&["-D", &datadir.to_string_lossy()])
.arg("-N")
.env_clear()
.env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("LD_LIBRARY_PATH", conf.pg_lib_dir())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir())
.output()
.expect("failed to execute initdb");
.map_err(|e| Error::new(e.kind(), format!("failed to execute initdb: {}", e)))?;
if !initdb.status.success() {
panic!(
"initdb failed: {}\nstderr:\n{}",
std::str::from_utf8(&initdb.stdout).unwrap(),
std::str::from_utf8(&initdb.stderr).unwrap()
);
return Err(Error::new(
ErrorKind::Other,
format!(
"initdb failed\nstdout: {}\nstderr:\n{}",
String::from_utf8_lossy(&initdb.stdout),
String::from_utf8_lossy(&initdb.stderr)
),
));
} else {
// Limit shared cache for wal-redo-postres
let mut config = OpenOptions::new()
@@ -572,11 +600,16 @@ impl PostgresRedoProcess {
.stderr(Stdio::piped())
.stdout(Stdio::piped())
.env_clear()
.env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("LD_LIBRARY_PATH", conf.pg_lib_dir())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir())
.env("PGDATA", &datadir)
.spawn()
.expect("postgres --wal-redo command failed to start");
.map_err(|e| {
Error::new(
e.kind(),
format!("postgres --wal-redo command failed to start: {}", e),
)
})?;
info!(
"launched WAL redo postgres process on {:?}",
@@ -636,7 +669,10 @@ impl PostgresRedoProcess {
{
build_apply_record_msg(*lsn, postgres_rec, &mut writebuf);
} else {
panic!("tried to pass zenith wal record to postgres WAL redo");
return Err(Error::new(
ErrorKind::Other,
"tried to pass zenith wal record to postgres WAL redo",
));
}
}
build_get_page_msg(tag, &mut writebuf);

View File

@@ -17,8 +17,8 @@ log = "0.4.14"
memoffset = "0.6.2"
thiserror = "1.0"
serde = { version = "1.0", features = ["derive"] }
workspace_hack = { path = "../workspace_hack" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
[build-dependencies]
bindgen = "0.59.1"

View File

@@ -24,6 +24,9 @@ pub const VISIBILITYMAP_FORKNUM: u8 = 2;
pub const INIT_FORKNUM: u8 = 3;
// From storage_xlog.h
pub const XLOG_SMGR_CREATE: u8 = 0x10;
pub const XLOG_SMGR_TRUNCATE: u8 = 0x20;
pub const SMGR_TRUNCATE_HEAP: u32 = 0x0001;
pub const SMGR_TRUNCATE_VM: u32 = 0x0002;
pub const SMGR_TRUNCATE_FSM: u32 = 0x0004;
@@ -113,7 +116,6 @@ pub const XACT_XINFO_HAS_TWOPHASE: u32 = 1u32 << 4;
// From pg_control.h and rmgrlist.h
pub const XLOG_NEXTOID: u8 = 0x30;
pub const XLOG_SWITCH: u8 = 0x40;
pub const XLOG_SMGR_TRUNCATE: u8 = 0x20;
pub const XLOG_FPI_FOR_HINT: u8 = 0xA0;
pub const XLOG_FPI: u8 = 0xB0;
pub const DB_SHUTDOWNED: u32 = 1;

View File

@@ -4,7 +4,7 @@
//! This understands the WAL page and record format, enough to figure out where the WAL record
//! boundaries are, and to reassemble WAL records that cross page boundaries.
//!
//! This functionality is needed by both the pageserver and the walkeepers. The pageserver needs
//! This functionality is needed by both the pageserver and the safekeepers. The pageserver needs
//! to look deeper into the WAL records to also understand which blocks they modify, the code
//! for that is in pageserver/src/walrecord.rs
//!

View File

@@ -495,7 +495,13 @@ mod tests {
.env("DYLD_LIBRARY_PATH", &lib_path)
.output()
.unwrap();
assert!(initdb_output.status.success());
assert!(
initdb_output.status.success(),
"initdb failed. Status: '{}', stdout: '{}', stderr: '{}'",
initdb_output.status,
String::from_utf8_lossy(&initdb_output.stdout),
String::from_utf8_lossy(&initdb_output.stderr),
);
// 2. Pick WAL generated by initdb
let wal_dir = data_dir.join("pg_wal");

View File

@@ -5,12 +5,14 @@ edition = "2021"
[dependencies]
anyhow = "1.0"
base64 = "0.13.0"
bytes = { version = "1.0.1", features = ['serde'] }
clap = "3.0"
fail = "0.5.0"
futures = "0.3.13"
hashbrown = "0.11.2"
hex = "0.4.3"
hmac = "0.10.1"
hyper = "0.14"
lazy_static = "1.4.0"
md5 = "0.7.0"
@@ -18,18 +20,25 @@ parking_lot = "0.11.2"
pin-project-lite = "0.2.7"
rand = "0.8.3"
reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
rustls = "0.19.1"
routerify = "2"
rustls = "0.20.0"
rustls-pemfile = "0.2.1"
scopeguard = "1.1.0"
serde = "1"
serde_json = "1"
thiserror = "1.0"
tokio = { version = "1.11", features = ["macros"] }
sha2 = "0.9.8"
socket2 = "0.4.4"
thiserror = "1.0.30"
tokio = { version = "1.17", features = ["macros"] }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-rustls = "0.22.0"
tokio-rustls = "0.23.0"
zenith_utils = { path = "../zenith_utils" }
zenith_metrics = { path = "../zenith_metrics" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
[dev-dependencies]
tokio-postgres-rustls = "0.8.0"
async-trait = "0.1"
rcgen = "0.8.14"
rstest = "0.12"
tokio-postgres-rustls = "0.9.0"

View File

@@ -1,14 +1,24 @@
mod credentials;
#[cfg(test)]
mod flow;
use crate::compute::DatabaseInfo;
use crate::config::ProxyConfig;
use crate::cplane_api::{self, CPlaneApi};
use crate::error::UserFacingError;
use crate::stream::PqStream;
use crate::waiters;
use std::collections::HashMap;
use std::io;
use thiserror::Error;
use tokio::io::{AsyncRead, AsyncWrite};
use zenith_utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};
pub use credentials::ClientCredentials;
#[cfg(test)]
pub use flow::*;
/// Common authentication error.
#[derive(Debug, Error)]
pub enum AuthErrorImpl {
@@ -16,13 +26,17 @@ pub enum AuthErrorImpl {
#[error(transparent)]
Console(#[from] cplane_api::AuthError),
#[cfg(test)]
#[error(transparent)]
Sasl(#[from] crate::sasl::Error),
/// For passwords that couldn't be processed by [`parse_password`].
#[error("Malformed password message")]
MalformedPassword,
/// Errors produced by [`PqStream`].
#[error(transparent)]
Io(#[from] std::io::Error),
Io(#[from] io::Error),
}
impl AuthErrorImpl {
@@ -67,70 +81,6 @@ impl UserFacingError for AuthError {
}
}
#[derive(Debug, Error)]
pub enum ClientCredsParseError {
#[error("Parameter `{0}` is missing in startup packet")]
MissingKey(&'static str),
}
impl UserFacingError for ClientCredsParseError {}
/// Various client credentials which we use for authentication.
#[derive(Debug, PartialEq, Eq)]
pub struct ClientCredentials {
pub user: String,
pub dbname: String,
}
impl TryFrom<HashMap<String, String>> for ClientCredentials {
type Error = ClientCredsParseError;
fn try_from(mut value: HashMap<String, String>) -> Result<Self, Self::Error> {
let mut get_param = |key| {
value
.remove(key)
.ok_or(ClientCredsParseError::MissingKey(key))
};
let user = get_param("user")?;
let db = get_param("database")?;
Ok(Self { user, dbname: db })
}
}
impl ClientCredentials {
/// Use credentials to authenticate the user.
pub async fn authenticate(
self,
config: &ProxyConfig,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
) -> Result<DatabaseInfo, AuthError> {
fail::fail_point!("proxy-authenticate", |_| {
Err(AuthError::auth_failed("failpoint triggered"))
});
use crate::config::ClientAuthMethod::*;
use crate::config::RouterConfig::*;
match &config.router_config {
Static { host, port } => handle_static(host.clone(), *port, client, self).await,
Dynamic(Mixed) => {
if self.user.ends_with("@zenith") {
handle_existing_user(config, client, self).await
} else {
handle_new_user(config, client).await
}
}
Dynamic(Password) => handle_existing_user(config, client, self).await,
Dynamic(Link) => handle_new_user(config, client).await,
}
}
}
fn new_psql_session_id() -> String {
hex::encode(rand::random::<[u8; 8]>())
}
async fn handle_static(
host: String,
port: u16,
@@ -169,7 +119,7 @@ async fn handle_existing_user(
let md5_salt = rand::random();
client
.write_message(&Be::AuthenticationMD5Password(&md5_salt))
.write_message(&Be::AuthenticationMD5Password(md5_salt))
.await?;
// Read client's password hash
@@ -213,6 +163,10 @@ async fn handle_new_user(
Ok(db_info)
}
fn new_psql_session_id() -> String {
hex::encode(rand::random::<[u8; 8]>())
}
fn parse_password(bytes: &[u8]) -> Option<&str> {
std::str::from_utf8(bytes).ok()?.strip_suffix('\0')
}

View File

@@ -0,0 +1,70 @@
//! User credentials used in authentication.
use super::AuthError;
use crate::compute::DatabaseInfo;
use crate::config::ProxyConfig;
use crate::error::UserFacingError;
use crate::stream::PqStream;
use std::collections::HashMap;
use thiserror::Error;
use tokio::io::{AsyncRead, AsyncWrite};
#[derive(Debug, Error)]
pub enum ClientCredsParseError {
#[error("Parameter `{0}` is missing in startup packet")]
MissingKey(&'static str),
}
impl UserFacingError for ClientCredsParseError {}
/// Various client credentials which we use for authentication.
#[derive(Debug, PartialEq, Eq)]
pub struct ClientCredentials {
pub user: String,
pub dbname: String,
}
impl TryFrom<HashMap<String, String>> for ClientCredentials {
type Error = ClientCredsParseError;
fn try_from(mut value: HashMap<String, String>) -> Result<Self, Self::Error> {
let mut get_param = |key| {
value
.remove(key)
.ok_or(ClientCredsParseError::MissingKey(key))
};
let user = get_param("user")?;
let db = get_param("database")?;
Ok(Self { user, dbname: db })
}
}
impl ClientCredentials {
/// Use credentials to authenticate the user.
pub async fn authenticate(
self,
config: &ProxyConfig,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
) -> Result<DatabaseInfo, AuthError> {
fail::fail_point!("proxy-authenticate", |_| {
Err(AuthError::auth_failed("failpoint triggered"))
});
use crate::config::ClientAuthMethod::*;
use crate::config::RouterConfig::*;
match &config.router_config {
Static { host, port } => super::handle_static(host.clone(), *port, client, self).await,
Dynamic(Mixed) => {
if self.user.ends_with("@zenith") {
super::handle_existing_user(config, client, self).await
} else {
super::handle_new_user(config, client).await
}
}
Dynamic(Password) => super::handle_existing_user(config, client, self).await,
Dynamic(Link) => super::handle_new_user(config, client).await,
}
}
}

102
proxy/src/auth/flow.rs Normal file
View File

@@ -0,0 +1,102 @@
//! Main authentication flow.
use super::{AuthError, AuthErrorImpl};
use crate::stream::PqStream;
use crate::{sasl, scram};
use std::io;
use tokio::io::{AsyncRead, AsyncWrite};
use zenith_utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage, BeMessage as Be};
/// Every authentication selector is supposed to implement this trait.
pub trait AuthMethod {
/// Any authentication selector should provide initial backend message
/// containing auth method name and parameters, e.g. md5 salt.
fn first_message(&self) -> BeMessage<'_>;
}
/// Initial state of [`AuthFlow`].
pub struct Begin;
/// Use [SCRAM](crate::scram)-based auth in [`AuthFlow`].
pub struct Scram<'a>(pub &'a scram::ServerSecret);
impl AuthMethod for Scram<'_> {
#[inline(always)]
fn first_message(&self) -> BeMessage<'_> {
Be::AuthenticationSasl(BeAuthenticationSaslMessage::Methods(scram::METHODS))
}
}
/// Use password-based auth in [`AuthFlow`].
pub struct Md5(
/// Salt for client.
pub [u8; 4],
);
impl AuthMethod for Md5 {
#[inline(always)]
fn first_message(&self) -> BeMessage<'_> {
Be::AuthenticationMD5Password(self.0)
}
}
/// This wrapper for [`PqStream`] performs client authentication.
#[must_use]
pub struct AuthFlow<'a, Stream, State> {
/// The underlying stream which implements libpq's protocol.
stream: &'a mut PqStream<Stream>,
/// State might contain ancillary data (see [`AuthFlow::begin`]).
state: State,
}
/// Initial state of the stream wrapper.
impl<'a, S: AsyncWrite + Unpin> AuthFlow<'a, S, Begin> {
/// Create a new wrapper for client authentication.
pub fn new(stream: &'a mut PqStream<S>) -> Self {
Self {
stream,
state: Begin,
}
}
/// Move to the next step by sending auth method's name & params to client.
pub async fn begin<M: AuthMethod>(self, method: M) -> io::Result<AuthFlow<'a, S, M>> {
self.stream.write_message(&method.first_message()).await?;
Ok(AuthFlow {
stream: self.stream,
state: method,
})
}
}
/// Stream wrapper for handling simple MD5 password auth.
impl<S: AsyncRead + AsyncWrite + Unpin> AuthFlow<'_, S, Md5> {
/// Perform user authentication. Raise an error in case authentication failed.
#[allow(unused)]
pub async fn authenticate(self) -> Result<(), AuthError> {
unimplemented!("MD5 auth flow is yet to be implemented");
}
}
/// Stream wrapper for handling [SCRAM](crate::scram) auth.
impl<S: AsyncRead + AsyncWrite + Unpin> AuthFlow<'_, S, Scram<'_>> {
/// Perform user authentication. Raise an error in case authentication failed.
pub async fn authenticate(self) -> Result<(), AuthError> {
// Initial client message contains the chosen auth method's name.
let msg = self.stream.read_password_message().await?;
let sasl = sasl::FirstMessage::parse(&msg).ok_or(AuthErrorImpl::MalformedPassword)?;
// Currently, the only supported SASL method is SCRAM.
if !scram::METHODS.contains(&sasl.method) {
return Err(AuthErrorImpl::auth_failed("method not supported").into());
}
let secret = self.state.0;
sasl::SaslStream::new(self.stream, sasl.message)
.authenticate(scram::Exchange::new(secret, rand::random, None))
.await?;
Ok(())
}
}

View File

@@ -24,7 +24,7 @@ pub enum ConnectionError {
impl UserFacingError for ConnectionError {}
/// Compute node connection params.
#[derive(Serialize, Deserialize, Debug, Default)]
#[derive(Serialize, Deserialize, Default)]
pub struct DatabaseInfo {
pub host: String,
pub port: u16,
@@ -33,6 +33,16 @@ pub struct DatabaseInfo {
pub password: Option<String>,
}
// Manually implement debug to omit personal and sensitive info
impl std::fmt::Debug for DatabaseInfo {
fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result {
fmt.debug_struct("DatabaseInfo")
.field("host", &self.host)
.field("port", &self.port)
.finish()
}
}
/// PostgreSQL version as [`String`].
pub type Version = String;
@@ -41,6 +51,7 @@ impl DatabaseInfo {
let host_port = format!("{}:{}", self.host, self.port);
let socket = TcpStream::connect(host_port).await?;
let socket_addr = socket.peer_addr()?;
socket2::SockRef::from(&socket).set_keepalive(true)?;
Ok((socket_addr, socket))
}

View File

@@ -1,10 +1,9 @@
use anyhow::{anyhow, bail, ensure, Context};
use rustls::{internal::pemfile, NoClientAuth, ProtocolVersion, ServerConfig};
use anyhow::{bail, ensure, Context};
use std::net::SocketAddr;
use std::str::FromStr;
use std::sync::Arc;
pub type TlsConfig = Arc<ServerConfig>;
pub type TlsConfig = Arc<rustls::ServerConfig>;
#[non_exhaustive]
pub enum ClientAuthMethod {
@@ -61,21 +60,28 @@ pub struct ProxyConfig {
pub fn configure_ssl(key_path: &str, cert_path: &str) -> anyhow::Result<TlsConfig> {
let key = {
let key_bytes = std::fs::read(key_path).context("SSL key file")?;
let mut keys = pemfile::pkcs8_private_keys(&mut &key_bytes[..])
.map_err(|_| anyhow!("couldn't read TLS keys"))?;
let mut keys = rustls_pemfile::pkcs8_private_keys(&mut &key_bytes[..])
.context("couldn't read TLS keys")?;
ensure!(keys.len() == 1, "keys.len() = {} (should be 1)", keys.len());
keys.pop().unwrap()
keys.pop().map(rustls::PrivateKey).unwrap()
};
let cert_chain = {
let cert_chain_bytes = std::fs::read(cert_path).context("SSL cert file")?;
pemfile::certs(&mut &cert_chain_bytes[..])
.map_err(|_| anyhow!("couldn't read TLS certificates"))?
rustls_pemfile::certs(&mut &cert_chain_bytes[..])
.context("couldn't read TLS certificate chain")?
.into_iter()
.map(rustls::Certificate)
.collect()
};
let mut config = ServerConfig::new(NoClientAuth::new());
config.set_single_cert(cert_chain, key)?;
config.versions = vec![ProtocolVersion::TLSv1_3];
let config = rustls::ServerConfig::builder()
.with_safe_default_cipher_suites()
.with_safe_default_kx_groups()
.with_protocol_versions(&[&rustls::version::TLS13])?
.with_no_client_auth()
.with_single_cert(cert_chain, key)?;
Ok(config.into())
}

View File

@@ -1,19 +1,8 @@
///
/// Postgres protocol proxy/router.
///
/// This service listens psql port and can check auth via external service
/// (control plane API in our case) and can create new databases and accounts
/// in somewhat transparent manner (again via communication with control plane API).
///
use anyhow::{bail, Context};
use clap::{App, Arg};
use config::ProxyConfig;
use futures::FutureExt;
use std::future::Future;
use tokio::{net::TcpListener, task::JoinError};
use zenith_utils::GIT_VERSION;
use crate::config::{ClientAuthMethod, RouterConfig};
//! Postgres protocol proxy/router.
//!
//! This service listens psql port and can check auth via external service
//! (control plane API in our case) and can create new databases and accounts
//! in somewhat transparent manner (again via communication with control plane API).
mod auth;
mod cancellation;
@@ -27,6 +16,24 @@ mod proxy;
mod stream;
mod waiters;
// Currently SCRAM is only used in tests
#[cfg(test)]
mod parse;
#[cfg(test)]
mod sasl;
#[cfg(test)]
mod scram;
use anyhow::{bail, Context};
use clap::{App, Arg};
use config::ProxyConfig;
use futures::FutureExt;
use std::future::Future;
use tokio::{net::TcpListener, task::JoinError};
use zenith_utils::GIT_VERSION;
use crate::config::{ClientAuthMethod, RouterConfig};
/// Flattens `Result<Result<T>>` into `Result<T>`.
async fn flatten_err(
f: impl Future<Output = Result<anyhow::Result<()>, JoinError>>,

Some files were not shown because too many files have changed in this diff Show More