Compare commits

...

180 Commits

Author SHA1 Message Date
Konstantin Knizhnik
0a049ae17a Not working version 2022-05-05 08:53:21 +03:00
Konstantin Knizhnik
32557b16b4 Use prepared dictionary for layer reconstruction 2022-05-04 18:17:33 +03:00
Konstantin Knizhnik
076b8e3d04 Use zstd::bulk::Decompressor::decompress instead decompredd_to_buffer 2022-05-03 11:28:32 +03:00
Konstantin Knizhnik
39eadf6236 Use zstd::bulk::Decompressor to decode WAL records to minimize number of context initalization 2022-05-03 09:59:33 +03:00
Heikki Linnakangas
4472d49c1e Reuse the zstd Compressor context when building delta layer. 2022-05-03 01:47:39 +03:00
Konstantin Knizhnik
dc057ace2f Fix formatting 2022-05-02 07:58:07 +03:00
Konstantin Knizhnik
0e49d748b8 Fix bug in dictinary creation 2022-05-02 07:58:07 +03:00
Konstantin Knizhnik
fc7d1ba043 Do not compress delta layers if there are too few elements 2022-05-02 07:58:07 +03:00
Konstantin Knizhnik
e28b3dee37 Implement compression of image and delta layers 2022-05-02 07:58:07 +03:00
Dhammika Pathirana
992874c916 Fix update ps settings doc
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:52:08 -07:00
Dhammika Pathirana
3128e8c75c Fix tenant conf test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:13:25 -07:00
Dhammika Pathirana
f3f12db2cb Add gc churn threshold knob (#1594)
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:13:17 -07:00
Andrey Taranik
038ea4c128 proxy notice message update (#1600) 2022-04-30 22:04:08 +03:00
Kirill Bulatov
7e1db8c8a1 Show which virtual file got the deserialization errors 2022-04-29 21:40:57 +03:00
Andrey Taranik
aa933d3961 proxy settings update for new domain (#1597) 2022-04-29 20:05:14 +03:00
Dmitry Rodionov
67b4e38092 remporarily disable test_backpressure_received_lsn_lag 2022-04-29 15:53:56 +03:00
Dmitry Rodionov
05f8e6a050 Use fsync+rename for atomic downloads from remote storage
Use failpoint in test_remote_storage to check the behavior
2022-04-29 15:53:56 +03:00
chaitanya sharma
76388abeb6 Rename READMEs with .md extension, and fix links to them.
Commit edba2e97 renamed pageserver/README to pageserver/README.md, but
forgot to update links to it. Fix.

Rename libs/postgres_ffi/README and safekeeper/README files to also
have the the .md extension, so that github can render them nicely.

Quote ascii-diagram in safekeeper/README.md so that it renders
correctly.
2022-04-29 14:23:42 +03:00
Kirill Bulatov
2911eb084a Remove timeline files on detach 2022-04-29 09:19:18 +03:00
Kirill Bulatov
6cca57f95a Properly remove from the local timeline map 2022-04-29 09:19:18 +03:00
Kirill Bulatov
4a46b01caf Properly populate local timeline map 2022-04-29 09:19:18 +03:00
Anastasia Lubennikova
5c5c3c64f3 Fix tenant config parsing. Add a test 2022-04-28 11:49:19 +03:00
Arthur Petukhovsky
29539b0561 Set wal_keep_size to zero (#1507)
wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.
2022-04-27 19:09:28 +03:00
Dmitry Rodionov
695b5f9d88 Remove obsolete failpoint in proxy
When failpoint feature is disabled it throws away passed code so code
inside is not guaranteed to compile when feature is disabled. In this
particular case code is obsolete so removing it.
2022-04-27 14:34:33 +03:00
Dhammika Pathirana
66694e736a Fix add ps tenant config
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
091cefaa92 Fix add compaction for key partitioning
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
aeb4f81c3b Add branch traversal unit test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
6391862d8a Add branch traversal test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
b2e35fffa6 Fix ancestor layer traversal (#1484)
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Arseny Sher
8b9d523f3c Remove old WAL on safekeepers.
Remove when it is consumed by all of 1) pageserver (remote_consistent_lsn) 2)
safekeeper peers 3) s3 WAL offloading.

In test s3 offloading for now is mocked by directly bumping s3_wal_lsn.

ref #1403
2022-04-26 23:02:23 +04:00
Arseny Sher
3fd234da07 Enable etcd for safekeepers in deploy. 2022-04-26 18:13:50 +04:00
Kirill Bulatov
778744d35c Limit concurrent S3 and IAM interactions 2022-04-26 13:49:37 +03:00
Dmitry Rodionov
eabf6f89e4 Use item.get for tenant config toml parsing
Previously we've used table interface, but there was no easy way to pass
it as an override to pageserver through cli. Use the same strategy as
for remote storage config parsing
2022-04-26 10:15:19 +03:00
Kirill Bulatov
fec050ce97 Fix macos clippy issues 2022-04-25 16:23:34 +03:00
Kirill Bulatov
d060a97c54 Simplify clippy runs 2022-04-25 16:23:34 +03:00
Anastasia Lubennikova
78a6cb247f allow the users to create extensions: GRANT CREATE ON DATABASE 2022-04-25 15:35:44 +03:00
Kirill Bulatov
8f6a161271 Show better layer load errors 2022-04-25 14:54:39 +03:00
Andrey Taranik
56f6269a8e rename docker images to neondatabase docker account (#1570)
* rename docker images to neondatabase docker account

* docker images build fix (permisions for Cargo.lock)
2022-04-25 11:34:51 +03:00
Heikki Linnakangas
1fb3d08185 Use a 1-byte length header for short blobs.
Notably, this shaves 3 bytes from each small WAL record stored in
ephemeral or delta layers.
2022-04-22 21:31:27 +03:00
bojanserafimov
867aede715 Add idle compute restart time test (#1514) 2022-04-22 10:45:47 -04:00
Dmitry Ivanov
d3f356e7a8 Update rust-postgres project-wide (#1525)
* Update `rust-postgres` project-wide

This commit points to https://github.com/neondatabase/rust-postgres/commits/neon
in order to test our patches on top of the latest version of this crate.

* [proxy] Update `hmac` and `sha2`
2022-04-22 17:31:58 +03:00
Konstantin Knizhnik
5f83c9290b Make it possible to specify per-tenant configuration parameters
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters

Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console

Define PiTR interval for GC.

refer #1320
2022-04-22 11:24:29 +03:00
Heikki Linnakangas
a4700c9bbe Use pprof to get flamegraph of get_page and get_relsize requests.
This depends on a hacked version of the 'pprof-rs' crate. Because of
that, it's under an optional 'profiling' feature. It is disabled by
default, but enabled for release builds in CircleCI config. It doesn't
currently work on macOS.

The flamegraph is written to 'flamegraph.svg' in the pageserver
workdir when the 'pageserver' process exits.

Add a performance test that runs the perf_pgbench test, with profiling
enabled.
2022-04-21 20:32:48 +03:00
Heikki Linnakangas
dafdf9b952 Handle EINTR 2022-04-21 16:37:36 +03:00
Heikki Linnakangas
263d60f12d Add prometheus metric for time spent waiting for WAL to arrive 2022-04-21 16:37:32 +03:00
Arseny Sher
abcd7a4b1f Insert less data in test_wal_restore.
Otherwise it sometimes hits 2m statement timeout in CI.
2022-04-21 16:00:15 +04:00
Kirill Bulatov
81cad6277a Move and library crates into a dedicated directory and rename them 2022-04-21 13:30:33 +03:00
Kirill Bulatov
629688fd6c Drop redundant resolver setting for 2021 edition 2022-04-21 13:30:33 +03:00
Heikki Linnakangas
9d3779c124 Add a counter for materialized page cache hits. 2022-04-20 21:26:03 +03:00
Heikki Linnakangas
334a1d6b5d Fix materialized page caching with delta layers.
We only checked the cache page version when collecting WAL records in
an in-memory layer, not in a delta layer. Refactor the code so that we
always stop collecting WAL records when we reach a cached materialized
page.

Fix the assertion on the LSN range in
InMemoryLayer::get_value_reconstruct_data. It was supposed to check
that the requested LSN range is within the layer's LSN range, but the
inequality was backwards. That went unnoticed before, because the
caller always passed the layer's start LSN as the requested LSN
range's start LSN, but now we might stop the search earlier, if we have
a cached page version.

Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
2022-04-20 21:25:59 +03:00
Dmitry Rodionov
e41ad3be0f add more context to writeback error 2022-04-20 17:07:07 +03:00
Heikki Linnakangas
e113c6fa8d Print a warning if unlinking an ephemeral file fails.
Unlink failure isn't serious on its own, we were about to remove the
file anyway, but it shouldn't happen and could be a symptom of
something more serious.

We just saw "No such file or directory" errors happening from
ephemeral file writeback in staging, and I suspect if we had this
warning in place, we would have seen these warnings too, if the
problem was that the ephemeral file was removed before dropping the
EphemeralFile struct. Next time it happens, we'll have more
information.
2022-04-20 16:23:16 +03:00
Heikki Linnakangas
cbdfd8c719 Update 'routerify' dependency in proxy.
routerify version 3 is used in zenith_utils, use the same version in proxy
to avoid having to build two versions.
2022-04-20 14:42:05 +03:00
Heikki Linnakangas
86bf4301b7 Remove unnecessary dependency on 'webpki' 2022-04-20 14:36:54 +03:00
Heikki Linnakangas
9eaa21317c Update jsonwebtoken crate.
With this, we no longer need to build two versions of 'pem' and 'base64'
crates. Introduces a duplicate version of 'time' crate, though, but it's
still progress.
2022-04-20 14:27:49 +03:00
Heikki Linnakangas
e660e12f79 Update rustls-split and rustls versions.
All dependencies now use rustls 0.20.2, so we no longer need to build two
versions of it.
2022-04-20 14:07:55 +03:00
Konstantin Knizhnik
ac52f4f2d6 Set superuser when initializing database for wal recovery (#1544) 2022-04-20 13:24:38 +03:00
Heikki Linnakangas
5e95338ee9 Improve logging in test_wal_restore.py
- Capture the output of the restore_from_wal.sh in a log file
- Kill "restored" Postgres server on test failure
2022-04-20 11:18:40 +03:00
Heikki Linnakangas
170badd626 Capture the postgres log in all tests that start a vanilla Postgres. 2022-04-20 11:18:40 +03:00
Kirill Bulatov
91fb21225a Show more logs during S3 sync 2022-04-20 02:57:03 +03:00
Kirill Bulatov
3e6087a12f Remove S3 archiving 2022-04-19 23:13:52 +03:00
Kirill Bulatov
44bfc529f6 Require specifying the upload size in remote storage 2022-04-19 23:13:52 +03:00
bojanserafimov
ef72eb84cf Remove zenfixture (#1534) 2022-04-19 09:46:47 -04:00
Kirill Bulatov
a1e34772e5 Improve compute error logging 2022-04-19 00:20:08 +03:00
Stas Kelvich
389bd1faeb Support for SCRAM-SHA-256 in compute tools 2022-04-18 22:19:01 +03:00
Anastasia Lubennikova
c15aa04714 Move Cluster size limit RFC from rfcs repo 2022-04-18 18:11:31 +03:00
Kirill Bulatov
52e0816fa5 wal_acceptor -> safekeeper 2022-04-18 12:52:31 +03:00
Kirill Bulatov
81417788c8 walkeeper -> safekeeper 2022-04-18 12:52:31 +03:00
Kirill Bulatov
81879f8137 Restore missing cachepot env vars 2022-04-18 12:32:04 +03:00
Arseny Sher
5b29774532 Small refactoring after ec3bc74165.
Move record_safekeeper_info inside safekeeper.rs, fix commit_lsn update, sync
control file.
2022-04-18 13:11:34 +04:00
Kirill Bulatov
0ca2bd929b Remove log crate from pageserver 2022-04-18 00:00:36 +03:00
Kirill Bulatov
9b7dcc2bae Use proper cachepot bucket 2022-04-17 16:35:40 +03:00
Kirill Bulatov
3136a0754a Use mold in Docker images 2022-04-17 00:50:28 +03:00
Kirill Bulatov
787f0d33f0 Use another cachepot bucket for rust Docker build caches 2022-04-16 23:36:42 +03:00
Kirill Bulatov
ed5f9acca9 Revert "Revert libc upgrade" (#1527)
This reverts commit 4bc338babc.
2022-04-16 13:38:48 +03:00
Kirill Bulatov
4bc338babc Revert libc upgrade 2022-04-16 10:03:26 +03:00
Kirill Bulatov
3ab090b43a Fix compute tools build 2022-04-15 23:12:35 +03:00
Kirill Bulatov
7126979950 Remove custom neon Docker build image 2022-04-15 20:08:22 +03:00
Arseny Sher
9946cd1125 Bump vendor/postgres to add safekeeper connection timeout. 2022-04-15 20:44:56 +04:00
Dmitry Ivanov
ab20f2c491 Use the same version of rust-postgres everywhere. (#1516)
Turns out we still had a stale dep in `compute_tools`.
2022-04-15 18:36:11 +03:00
Dmitry Ivanov
c9d897f9b6 [proxy] Update rustls (#1510) 2022-04-15 12:06:25 +03:00
Kirill Bulatov
e97f94cc30 Bump rustc version 2022-04-14 23:01:06 +03:00
Dmitry Rodionov
2cb39a1624 add missing files, update workspace hack 2022-04-14 20:41:21 +03:00
Heikki Linnakangas
93e0ac2b7a Remove a couple of unused dependencies.
Found by "cargo-udeps"
2022-04-14 17:38:26 +03:00
bojanserafimov
d5ae9db997 Add s3 cost estimate to tests (#1478) 2022-04-14 10:09:03 -04:00
Heikki Linnakangas
9e4de6bed0 Use RwLock instad of Mutex for layer map lock.
For more concurrency
2022-04-14 13:34:01 +03:00
Heikki Linnakangas
4a8c663452 Refactor pgbench tests.
- Remove batch_others/test_pgbench.py. It was a quick check that pgbench
  works, without actually recording any performance numbers, but that
  doesn't seem very interesting anymore. Remove it to avoid confusing it
  with the actual pgbench benchmarks

- Run pgbench with "-n" and "-S" options, for two different workloads:
  simple-updates, and SELECT-only. Previously, we would only run it with
  the "default" TPCB-like workload. That's more or less the same as the
  simple-update (-n) workload, but I think the simple-upload workload
  is more relevant for testing storage performance. The SELECT-only
  workload is a new thing to measure.

- Merge test_perf_pgbench.py and test_perf_pgbench_remote.py. I added
  a new "remote" implementation of the PgCompare class, which allows
  running the same tests against an already-running Postgres instance.

- Make the PgBenchRunResult.parse_from_output function more
  flexible. pgbench can print different lines depending on the
  command-line options, but the parsing function expected a particular
  set of lines.
2022-04-14 13:31:42 +03:00
Heikki Linnakangas
a009fe912a Refactor connection option handling in python tests
The PgProtocol.connect() function took extra options for username,
database, etc. Remove those options, and have a generic way for each
subclass of PgProtocol to provide some default options, with the
capability override them in the connect() call.
2022-04-14 13:31:40 +03:00
Heikki Linnakangas
19954dfd8a Refactor proxy options test to not rely on the 'schema' argument.
It was the only test that used the 'schema' argument to the connect()
function. I'm about to refactor the option handling and will remove
the special 'schema' argument altogether, so rewrite the test to not
use it.
2022-04-14 13:31:37 +03:00
Heikki Linnakangas
570db6f168 Update README for Zenith -> Neon renaming.
There's a lot of renaming left to do in the code and docs, but this is
a start. Our binaries and many other things are still called "zenith",
but I didn't change those in the README, because otherwise the
examples won't work. I added a brief note at the top of the README to
explain that we're in the process of renaming, until we've renamed
everything.
2022-04-14 11:30:01 +03:00
Arthur Petukhovsky
cdf04b6a9f Fix control file updates in safekeeper (#1452)
Now control_file::Storage implements Deref for read-only access to the state. All updates should clone the state before modifying and persisting.
2022-04-14 09:31:35 +03:00
Dhammika Pathirana
a0781f229c Add ps compact command
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add ps compact command to api (#707) (#1484)
2022-04-13 22:47:13 -07:00
Dmitry Rodionov
1d36c5a39e reenable s3 on staging pagservers by default
After deadlockk fix in https://github.com/neondatabase/neon/pull/1496 s3
seems to work normally. There is one more discovered issue but it is not
a blocker so can be fixed separately.
2022-04-13 20:10:39 +03:00
Dmitry Rodionov
49da76237b remove noisy debug log message 2022-04-13 19:50:31 +03:00
Dhammika Pathirana
1fd08107ca Add ps compaction_threshold config
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>

Add ps compaction_threadhold knob for (#707) (#1484)
2022-04-13 07:42:58 -07:00
Daniil
58d5136a61 compute_tools: check writability handler (#941) 2022-04-13 17:16:25 +03:00
Arthur Petukhovsky
87020f8126 Fix CI staging deploy (#1499)
- Remove stopped safekeeper from inventory
- Fix github pages address after neon rename
2022-04-13 10:59:29 +03:00
Dmitry Rodionov
20414c4b16 defuse possible deadlock in download_timeline too 2022-04-13 10:05:19 +03:00
Dmitry Rodionov
9b7a8e67a4 fix deadlock in upload_timeline_checkpoint
It originated from the fact that we were calling to fetch_full_index
without releasing the read guard, and fetch_full_index tries to acquire
read again. For plain mutex it is already a deeadlock, for RW lock
deadlock was achieved by an attempt to acquire write access later in the
code while still having active read guard up in the stack

This is sort of a bandaid because Kirill plans to change this code
during removal of an archiving mechanism
2022-04-13 10:05:19 +03:00
Dmitry Ivanov
4af87f3d60 [proxy] Add SCRAM auth mechanism implementation (#1050)
* [proxy] Add SCRAM auth

* [proxy] Implement some tests for SCRAM

* Refactoring + test fixes

* Hide SCRAM mechanism behind `#[cfg(test)]`

Currently we only use it in tests, so we hide all relevant
module behind `#[cfg(test)]` to prevent "unused item" warnings.
2022-04-13 03:00:32 +03:00
Alexey Kondratov
0fbe657b2f Fix remote e2e tests after repository rename (#1434)
Also start them after release build instead of debug. It saves 3-5
minutes and we anyway use release mode in Docker images.
2022-04-13 00:02:06 +03:00
Konstantin Knizhnik
07a9553700 Add test for restore from WAL (#1366)
* Add test for restore from WAL

* Fix python formatting

* Choose unused port in wal restore test

* Move recovery tests to zenith_utils/scripts

* Set LD_LIBRARY_PATH in wal recovery scripts

* Fix python test formatting

* Fix mypy warning

* Bump postgres version

* Bump postgres version
2022-04-11 22:30:08 +03:00
Kirill Bulatov
dc7e3ff05a Fix rustc 1.60 clippy warnings 2022-04-11 21:34:04 +03:00
Kirill Bulatov
4f172e7612 Replicate S3 blob metadata in the remote storage 2022-04-11 21:34:04 +03:00
Kirill Bulatov
0e9ee772af Use rusoto in safekeeper 2022-04-11 21:34:04 +03:00
Kirill Bulatov
db63fa64ae Use rusoto lib for S3 relish_storage impl 2022-04-11 21:34:04 +03:00
Arthur Petukhovsky
8e2a6661e9 Make wal_storage initialization eager (#1489) 2022-04-11 20:36:26 +03:00
Heikki Linnakangas
214567bf8f Use B-tree for the index in image and delta layers.
We now use a page cache for those, instead of slurping the whole index into
memory.

Fixes https://github.com/zenithdb/zenith/issues/1356

This is a backwards-incompatible change to the storage format, so
bump STORAGE_FORMAT_VERSION.
2022-04-07 20:58:55 +03:00
Heikki Linnakangas
c4b57e4b8f Move BlobRef
It's not needed in image layers anymore, so move it into delta_layer.rs
2022-04-07 20:58:55 +03:00
Heikki Linnakangas
5d9851f5d1 Refactor the I/O functions.
This introduces two new abstraction layers for I/O:

- Block I/O, and
- Blob I/O.

The BlockReader trait abstracts a file or something else that can be read
in 8kB pages. It is implemented by EphemeralFiles, and by a new
FileBlockReader struct that allows reading arbitrary VirtualFiles in that
manner, utilizing the page cache.

There is also a new BlockCursor struct that works as a cursor over a
BlockReader. When you create a BlockCursor and read the first page using
it, it keeps the reference to the page. If you access the same page again,
it avoids going to page cache and quickly returns the same page again.
That can save a lot of lookups in the page cache if you perform multiple
reads.

The Blob-oriented API allows reading and writing "blobs" of arbitrary
length. It is a layer on top of the block-oriented API. When you write
a blob with the write_blob() function, it writes a length field
followed by the actual data to the underlying block storage, and
returns the offset where the blob was stored. The blob can be
retrieved later using the offset.

Finally, this replaces the I/O code in image-, delta-, and in-memory
layers to use the new abstractions. These replace the 'bookfile'
crate.

This is a backwards-incompatible change to the storage format.
2022-04-07 20:58:54 +03:00
Arthur Petukhovsky
81ba23094e Fix scripts to deploy sk4 on staging (#1476)
Adjust ansible scripts and inventory for sk4 on staging
2022-04-07 20:38:26 +03:00
bojanserafimov
d5258cdc4d [proxy] Don't print passwords (#1298) 2022-04-06 20:05:24 -04:00
Arthur Petukhovsky
6bc78a0e77 Log more info in test_many_timelines asserts (#1473)
It will help to debug #1470 as soon as it happens again
2022-04-07 01:44:26 +03:00
bojanserafimov
6fe443e239 Improve random_writes test (#1469)
If you want to test with a 3GB database by tweaking some constants you'll hit a query timeout. I fix that by batching the inserts.
2022-04-06 18:32:10 -04:00
Alexey Kondratov
d0c246ac3c Update pageserver OpenAPI spec with missing attach/detach methods (#1463)
We have these methods for some time in the API, so mentioning them in the
spec could be useful for console (see zenithdb/console#867), as we generate
pageserver HTTP API golang client there.
2022-04-05 20:01:57 +03:00
Heikki Linnakangas
2f784144fe Avoid deadlock when locking two buffers.
It happened in unit tests. If a thread tries to read a buffer while
already holding a lock on one buffer, the code to find a victim buffer
to evict could try to evict the buffer that's already locked. To fix,
skip locked buffers.
2022-04-04 20:12:31 +03:00
Heikki Linnakangas
222b723354 Handle read errors when dumping a delta layer file.
If a file is corrupt, let's not stop on first read error, but continue
dumping.
2022-04-04 20:12:28 +03:00
Heikki Linnakangas
089ba6abfe Clean up some comments that still referred to 'segments' 2022-04-04 20:12:25 +03:00
Arthur Petukhovsky
a5a478c321 Bump vendor/postgres to store WAL on disk only (#1342)
Now WAL is no longer held in compute memory
2022-04-04 16:32:30 +03:00
Konstantin Knizhnik
fcf613b6e3 Fix unit tests build 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
572b3f48cf Add compaction_target_size parameter 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
bef9b837f1 Replace rwlock with mutex in repartition 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
232fe14297 Refactor partitioning 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
92031d376a Fix unit tests 2022-04-04 10:43:27 +03:00
Konstantin Knizhnik
1f0b406b63 Perform repartitioning in compaction thread
refer #1441
2022-04-04 10:43:27 +03:00
Kirill Bulatov
4c9447589a Place an info span into gc loop step 2022-04-03 19:30:36 +03:00
Kirill Bulatov
9e5423c867 Assert in a more informative way 2022-04-03 19:30:36 +03:00
Kirill Bulatov
43c16c5145 Don't log ZIds in the timeline load span 2022-04-03 19:30:36 +03:00
bojanserafimov
af712798e7 Fix pageserver readme formatting
I put the diagram in a fixed-width block, since it wasn't rendering correctly on github.
2022-04-02 00:36:54 +03:00
Dmitry Ivanov
f5da652388 [proxy] Enable keepalives for all tcp connections (#1448) 2022-03-31 20:44:57 +03:00
Anastasia Lubennikova
8745b022a9 Extend LayerMap dump() function to print also open_layers and frozen_layers.
Add verbose option to chose if we need to print all layer's keys or not.
2022-03-31 17:26:24 +03:00
Arthur Petukhovsky
a40b7cd516 Fix timeouts in test_restarts_under_load (#1436)
* Enable backpressure in test_restarts_under_load

* Remove hacks because #644 is fixed now

* Adjust config in test_restarts_under_load
2022-03-31 17:00:09 +03:00
Konstantin Knizhnik
1aa8fe43cf Fix race condition in image layer (#1440)
* Fix race condition in image layer

refer #1439

* Add explicit drop(inner) in layer load method

* Add explicit drop(inner) in layer load method
2022-03-31 15:47:59 +03:00
Dmitry Rodionov
649f324fe3 make logging in basebackup more consistent 2022-03-30 17:58:51 +03:00
Dmitry Rodionov
8609234204 decrease the log level to debug because it is too noisy 2022-03-30 10:13:38 +03:00
Anton Shyrabokau
5c5629910f Add a test case for reading historic page versions (#1314)
* Add a test case for reading historic page versions

 Test read_page_at_lsn returns correct results when compared to page inspect.
 Validate possiblity of reading pages from dropped relation.
 Ensure funcitons read latest version when null lsn supplied.
 Check that functions do not poison buffer cache with stale page versions.
2022-03-29 22:13:06 -07:00
Kirill Bulatov
277e41f4b7 Show s3 spans in logs and improve the log messages 2022-03-29 19:21:31 +03:00
Arthur Petukhovsky
ce0243bc12 Add metric for last_record_lsn (#1430) 2022-03-29 18:54:24 +03:00
Arseny Sher
ec3bc74165 Add safekeeper information exchange through etcd.
Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is
WAL truncation, for which every safekeeper must know remote_consistent_lsn; the
next would be callmemaybe replacement.

Adds corresponding '--broker' argument to safekeeper and ability to run etcd in
tests.

Adds test checking remote_consistent_lsn is indeed communicated.
2022-03-29 18:16:49 +04:00
Dmitry Rodionov
9594362f74 change python cache version to 2 (fixes python cache in circle CI) 2022-03-29 10:42:30 +03:00
Dmitry Rodionov
eee0f51e0c use cargo-hakari to manage workspace_hack crate
workspace_hack is needed to avoid recompilation when different crates
inside the workspace depend on the same packages but with different
features being enabled. Problem occurs when you build crates separately
one by one. So this is irrelevant to our CI setup because there we build
all binaries at once, but it may be relevant for local development.

this also changes cargo's resolver version to 2
2022-03-29 10:42:04 +03:00
Arthur Petukhovsky
fd78110c2b Add default statement_timeout for tests (#1423) 2022-03-29 09:57:00 +03:00
Anton Shyrabokau
be6a6958e2 CI: rebuild postgres when Makefile changes (#1429) 2022-03-28 18:19:20 -07:00
Kirill Bulatov
0e44887929 Show more S3 logs and less verbove WAL logs 2022-03-29 00:36:06 +03:00
Dhammika Pathirana
1aa57fc262 Fix tone down compact log chatter
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-03-28 13:24:13 -07:00
Alexey Kondratov
9a4f0930c0 Turn off S3 for pageserver on staging 2022-03-28 14:14:17 -05:00
Alexey Kondratov
d88f8b4a7e Fix storage deploy condition in ansible playbook 2022-03-28 13:30:40 -05:00
Arthur Petukhovsky
8a901de52a Refactor control file update at safekeeper.
Record global_commit_lsn, have common routine for control file update, add
SafekeeperMemstate.
2022-03-28 21:52:12 +04:00
Alexey Kondratov
a883202495 Enable S3 for pageserver on staging
Follow-up for #1417. Previously we had a problem uploading to S3
due to huge ammount of existing not yet uploaded data. Now we have a
fresh pageserver with LSM storage on staging, so we can try enabling it
once again.
2022-03-28 12:04:40 -05:00
Arseny Sher
780b46ad27 Bump vendor/postgres to fix commit_lsn going backwards. 2022-03-28 20:37:33 +04:00
Arseny Sher
75002adc14 Make shared_buffers large in test_pageserver_catchup.
We intentionally write while pageserver is down, so we shouldn't query it.

Noticed by @petuhovskiy at
https://github.com/zenithdb/postgres/pull/141#issuecomment-1080261700
2022-03-28 20:34:06 +04:00
Heikki Linnakangas
07342f7519 Major storage format rewrite.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.

Simplify Repository to a value-store
------------------------------------

Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.

It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.

Mapping from Postgres data directory to keys/values
---------------------------------------------------

All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.

The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.

'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.

The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.

Changes to low-level layer file code
------------------------------------

The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.

Checkpointing, compaction
-------------------------

The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.

Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.

Deployment
----------

This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.

Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
2022-03-28 05:41:15 -05:00
Kirill Bulatov
55de0b88f5 Hide remote timeline index access details 2022-03-28 12:36:01 +03:00
Kirill Bulatov
d56a0ee19a Avoid recompiling tests for release profile 2022-03-26 08:38:45 +02:00
Kirill Bulatov
18dfc769d8 Use cachepot to cache more rustc builds 2022-03-26 08:38:45 +02:00
Heikki Linnakangas
5e04dad360 Add more variants of the sequential scan performance tests.
More rows, and test with serial and parallel plans. But fewer iterations,
so that the tests run in < 1 minutes, and we don't need to mark them as
"slow".
2022-03-25 23:42:13 +02:00
Dmitry Rodionov
b8cba059a5 temporary disable s3 integration on staging until LSM storge rewrite lands 2022-03-26 00:19:25 +04:00
Heikki Linnakangas
e3fa00972e Use RwLocks in image and delta layers for more concurrency.
With a Mutex, only one thread could read from the layer at a time. I did
some ad hoc profiling with pgbench and saw that a fair amout of time was
spent blocked on these Mutexes.
2022-03-25 15:34:38 +02:00
Kirill Bulatov
b39d1b1717 Exit only on important thread failures 2022-03-25 11:58:54 +02:00
Kirill Bulatov
28bc8e3f5c Log pageserver threads better and shut down on errors in them 2022-03-25 11:58:54 +02:00
Kirill Bulatov
6244fd9e7e Better error messages on zenith cli subcommand invocations 2022-03-25 11:58:54 +02:00
Kirill Bulatov
f6b1d76c30 Replace assert! with ensure! for anyhow::Result functions 2022-03-25 11:58:54 +02:00
Kirill Bulatov
edc7bebcb5 Remove obvious panic sources 2022-03-25 11:58:54 +02:00
Kirill Bulatov
a201d33edc Properly print cachepot stats 2022-03-24 21:11:02 +02:00
Heikki Linnakangas
825d363170 Remove some unnecessary Ord etc. trait implementations.
It doesn't make much sense to compare TimelineMetadata structs with
< or >. But we depended on that in the remote storage upload code,
so replace BTreeSets with Vecs there.
2022-03-24 12:20:06 +02:00
Dmitry Rodionov
b9a1a75b0d clean up unused imports in python tests 2022-03-24 12:47:22 +04:00
Dmitry Rodionov
d3a9cb44a6 tweak timeouts for tenant relocation test 2022-03-24 12:47:22 +04:00
Heikki Linnakangas
c718870517 Tiny refactoring of page_cache::init function.
The init function only needs the 'page_cache_size' from the config, so
seems slightly nicer to pass just that.
2022-03-24 09:46:07 +02:00
Dmitry Rodionov
8437fc056e some follow ups after s3 integration was enabled on staging
* do not error out when upload file list is empty
* ignore ephemeral files during sync initialization
2022-03-23 23:35:36 +04:00
Dmitry Rodionov
8b8d78a3a0 use main branch of our bookfile crate 2022-03-23 22:05:43 +04:00
Dmitry Rodionov
8a86276a6e add more context to error 2022-03-23 18:38:15 +04:00
Dmitry Rodionov
0be7ed0cb5 decrease log message severity for timeline checkpoint internals 2022-03-23 18:20:43 +04:00
Dmitry Rodionov
e80ae4306a change log level from info to debug for timeline gc messages 2022-03-23 18:20:43 +04:00
Heikki Linnakangas
123fcd5d0d Revert accidental bump of vendor/postgres submodule
I accidentally bumped it in commit 3b069f5aef. It didn't seem to cause
any harm, but it was not intentional.
2022-03-23 15:45:29 +02:00
Kirill Bulatov
15434ba7e0 Show cachepot build stats 2022-03-23 14:12:59 +02:00
Andrey Taranik
a4d0d78e9e s3 settings for pageserver (#1388) 2022-03-23 13:39:55 +03:00
Dmitry Rodionov
e13bdd77fe add safekepeers gossip annd storage messaging rfcs
they were in prs during rfc repo import

in addition to just import I've added sequence diagrams to storage
messaging rfc
2022-03-22 15:01:26 +04:00
Kirill Bulatov
bd6bef468c Provide single list timelines HTTP API handle 2022-03-21 13:42:21 +02:00
Kirill Bulatov
77ed2a0fa0 Run GitHub testing workflow on every push 2022-03-21 12:46:33 +02:00
Kirill Bulatov
37ebbb598d Add a macOs build 2022-03-21 12:46:33 +02:00
247 changed files with 20938 additions and 11685 deletions

2
.circleci/ansible/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
zenith_install.tar.gz
.zenith_current_version

View File

@@ -1,67 +1,29 @@
- name: Upload Zenith binaries
hosts: pageservers:safekeepers
- name: Upload Neon binaries
hosts: storage
gather_facts: False
remote_user: admin
vars:
force_deploy: false
tasks:
- name: get latest version of Zenith binaries
ignore_errors: true
- name: get latest version of Neon binaries
register: current_version_file
set_fact:
current_version: "{{ lookup('file', '.zenith_current_version') | trim }}"
tags:
- pageserver
- safekeeper
- name: set zero value for current_version
when: current_version_file is failed
set_fact:
current_version: "0"
tags:
- pageserver
- safekeeper
- name: get deployed version from content of remote file
ignore_errors: true
ansible.builtin.slurp:
src: /usr/local/.zenith_current_version
register: remote_version_file
tags:
- pageserver
- safekeeper
- name: decode remote file content
when: remote_version_file is succeeded
set_fact:
remote_version: "{{ remote_version_file['content'] | b64decode | trim }}"
tags:
- pageserver
- safekeeper
- name: set zero value for remote_version
when: remote_version_file is failed
set_fact:
remote_version: "0"
current_version: "{{ lookup('file', '.neon_current_version') | trim }}"
tags:
- pageserver
- safekeeper
- name: inform about versions
debug: msg="Version to deploy - {{ current_version }}, version on storage node - {{ remote_version }}"
debug: msg="Version to deploy - {{ current_version }}"
tags:
- pageserver
- safekeeper
- name: upload and extract Zenith binaries to /usr/local
when: current_version > remote_version or force_deploy
- name: upload and extract Neon binaries to /usr/local
ansible.builtin.unarchive:
owner: root
group: root
src: zenith_install.tar.gz
src: neon_install.tar.gz
dest: /usr/local
become: true
tags:
@@ -74,14 +36,24 @@
hosts: pageservers
gather_facts: False
remote_user: admin
vars:
force_deploy: false
tasks:
- name: upload init script
when: console_mgmt_base_url is defined
ansible.builtin.template:
src: scripts/init_pageserver.sh
dest: /tmp/init_pageserver.sh
owner: root
group: root
mode: '0755'
become: true
tags:
- pageserver
- name: init pageserver
when: current_version > remote_version or force_deploy
shell:
cmd: sudo -u pageserver /usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" --init -D /storage/pageserver/data
cmd: /tmp/init_pageserver.sh
args:
creates: "/storage/pageserver/data/tenants"
environment:
@@ -91,8 +63,20 @@
tags:
- pageserver
- name: update remote storage (s3) config
lineinfile:
path: /storage/pageserver/data/pageserver.toml
line: "{{ item }}"
loop:
- "[remote_storage]"
- "bucket_name = '{{ bucket_name }}'"
- "bucket_region = '{{ bucket_region }}'"
- "prefix_in_bucket = '{{ inventory_hostname }}'"
become: true
tags:
- pageserver
- name: upload systemd service definition
when: current_version > remote_version or force_deploy
ansible.builtin.template:
src: systemd/pageserver.service
dest: /etc/systemd/system/pageserver.service
@@ -104,7 +88,6 @@
- pageserver
- name: start systemd service
when: current_version > remote_version or force_deploy
ansible.builtin.systemd:
daemon_reload: yes
name: pageserver
@@ -115,7 +98,7 @@
- pageserver
- name: post version to console
when: (current_version > remote_version or force_deploy) and console_mgmt_base_url is defined
when: console_mgmt_base_url is defined
shell:
cmd: |
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
@@ -127,22 +110,42 @@
hosts: safekeepers
gather_facts: False
remote_user: admin
vars:
force_deploy: false
tasks:
- name: upload init script
when: console_mgmt_base_url is defined
ansible.builtin.template:
src: scripts/init_safekeeper.sh
dest: /tmp/init_safekeeper.sh
owner: root
group: root
mode: '0755'
become: true
tags:
- safekeeper
- name: init safekeeper
shell:
cmd: /tmp/init_safekeeper.sh
args:
creates: "/storage/safekeeper/data/safekeeper.id"
environment:
ZENITH_REPO_DIR: "/storage/safekeeper/data"
LD_LIBRARY_PATH: "/usr/local/lib"
become: true
tags:
- safekeeper
# in the future safekeepers should discover pageservers byself
# but currently use first pageserver that was discovered
- name: set first pageserver var for safekeepers
when: current_version > remote_version or force_deploy
set_fact:
first_pageserver: "{{ hostvars[groups['pageservers'][0]]['inventory_hostname'] }}"
tags:
- safekeeper
- name: upload systemd service definition
when: current_version > remote_version or force_deploy
ansible.builtin.template:
src: systemd/safekeeper.service
dest: /etc/systemd/system/safekeeper.service
@@ -154,7 +157,6 @@
- safekeeper
- name: start systemd service
when: current_version > remote_version or force_deploy
ansible.builtin.systemd:
daemon_reload: yes
name: safekeeper
@@ -165,7 +167,7 @@
- safekeeper
- name: post version to console
when: (current_version > remote_version or force_deploy) and console_mgmt_base_url is defined
when: console_mgmt_base_url is defined
shell:
cmd: |
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)

View File

@@ -4,10 +4,10 @@ set -e
RELEASE=${RELEASE:-false}
# look at docker hub for latest tag fo zenith docker image
# look at docker hub for latest tag for neon docker image
if [ "${RELEASE}" = "true" ]; then
echo "search latest relase tag"
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/zenithdb/zenith/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1)
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1)
if [ -z "${VERSION}" ]; then
echo "no any docker tags found, exiting..."
exit 1
@@ -16,7 +16,7 @@ if [ "${RELEASE}" = "true" ]; then
fi
else
echo "search latest dev tag"
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/zenithdb/zenith/tags |jq -r -S '.[].name' | grep -v release | tail -1)
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep -v release | tail -1)
if [ -z "${VERSION}" ]; then
echo "no any docker tags found, exiting..."
exit 1
@@ -28,25 +28,25 @@ fi
echo "found ${VERSION}"
# do initial cleanup
rm -rf zenith_install postgres_install.tar.gz zenith_install.tar.gz .zenith_current_version
mkdir zenith_install
rm -rf neon_install postgres_install.tar.gz neon_install.tar.gz .neon_current_version
mkdir neon_install
# retrive binaries from docker image
echo "getting binaries from docker image"
docker pull --quiet zenithdb/zenith:${TAG}
ID=$(docker create zenithdb/zenith:${TAG})
docker pull --quiet neondatabase/neon:${TAG}
ID=$(docker create neondatabase/neon:${TAG})
docker cp ${ID}:/data/postgres_install.tar.gz .
tar -xzf postgres_install.tar.gz -C zenith_install
docker cp ${ID}:/usr/local/bin/pageserver zenith_install/bin/
docker cp ${ID}:/usr/local/bin/safekeeper zenith_install/bin/
docker cp ${ID}:/usr/local/bin/proxy zenith_install/bin/
docker cp ${ID}:/usr/local/bin/postgres zenith_install/bin/
tar -xzf postgres_install.tar.gz -C neon_install
docker cp ${ID}:/usr/local/bin/pageserver neon_install/bin/
docker cp ${ID}:/usr/local/bin/safekeeper neon_install/bin/
docker cp ${ID}:/usr/local/bin/proxy neon_install/bin/
docker cp ${ID}:/usr/local/bin/postgres neon_install/bin/
docker rm -vf ${ID}
# store version to file (for ansible playbooks) and create binaries tarball
echo ${VERSION} > zenith_install/.zenith_current_version
echo ${VERSION} > .zenith_current_version
tar -czf zenith_install.tar.gz -C zenith_install .
echo ${VERSION} > neon_install/.neon_current_version
echo ${VERSION} > .neon_current_version
tar -czf neon_install.tar.gz -C neon_install .
# do final cleaup
rm -rf zenith_install postgres_install.tar.gz
rm -rf neon_install postgres_install.tar.gz

View File

@@ -1,7 +1,17 @@
[pageservers]
zenith-1-ps-1
zenith-1-ps-1 console_region_id=1
[safekeepers]
zenith-1-sk-1
zenith-1-sk-2
zenith-1-sk-3
zenith-1-sk-1 console_region_id=1
zenith-1-sk-2 console_region_id=1
zenith-1-sk-3 console_region_id=1
[storage:children]
pageservers
safekeepers
[storage:vars]
console_mgmt_base_url = http://console-release.local
bucket_name = zenith-storage-oregon
bucket_region = us-west-2
etcd_endpoints = etcd-release.local:2379

View File

@@ -0,0 +1,30 @@
#!/bin/sh
# get instance id from meta-data service
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
# store fqdn hostname in var
HOST=$(hostname -f)
cat <<EOF | tee /tmp/payload
{
"version": 1,
"host": "${HOST}",
"port": 6400,
"region_id": {{ console_region_id }},
"instance_id": "${INSTANCE_ID}",
"http_host": "${HOST}",
"http_port": 9898
}
EOF
# check if pageserver already registered or not
if ! curl -sf -X PATCH -d '{}' {{ console_mgmt_base_url }}/api/v1/pageservers/${INSTANCE_ID} -o /dev/null; then
# not registered, so register it now
ID=$(curl -sf -X POST {{ console_mgmt_base_url }}/api/v1/pageservers -d@/tmp/payload | jq -r '.ID')
# init pageserver
sudo -u pageserver /usr/local/bin/pageserver -c "id=${ID}" -c "pg_distrib_dir='/usr/local'" --init -D /storage/pageserver/data
fi

View File

@@ -0,0 +1,30 @@
#!/bin/sh
# get instance id from meta-data service
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
# store fqdn hostname in var
HOST=$(hostname -f)
cat <<EOF | tee /tmp/payload
{
"version": 1,
"host": "${HOST}",
"port": 6500,
"region_id": {{ console_region_id }},
"instance_id": "${INSTANCE_ID}",
"http_host": "${HOST}",
"http_port": 7676
}
EOF
# check if safekeeper already registered or not
if ! curl -sf -X PATCH -d '{}' {{ console_mgmt_base_url }}/api/v1/safekeepers/${INSTANCE_ID} -o /dev/null; then
# not registered, so register it now
ID=$(curl -sf -X POST {{ console_mgmt_base_url }}/api/v1/safekeepers -d@/tmp/payload | jq -r '.ID')
# init safekeeper
sudo -u safekeeper /usr/local/bin/safekeeper --id ${ID} --init -D /storage/safekeeper/data
fi

View File

@@ -1,7 +1,18 @@
[pageservers]
zenith-us-stage-ps-1
#zenith-us-stage-ps-1 console_region_id=27
zenith-us-stage-ps-2 console_region_id=27
[safekeepers]
zenith-us-stage-sk-1
zenith-us-stage-sk-2
zenith-us-stage-sk-3
zenith-us-stage-sk-1 console_region_id=27
zenith-us-stage-sk-2 console_region_id=27
zenith-us-stage-sk-4 console_region_id=27
[storage:children]
pageservers
safekeepers
[storage:vars]
console_mgmt_base_url = http://console-staging.local
bucket_name = zenith-staging-storage-us-east-1
bucket_region = us-east-1
etcd_endpoints = etcd-staging.local:2379

View File

@@ -6,7 +6,7 @@ After=network.target auditd.service
Type=simple
User=safekeeper
Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/safekeeper/data LD_LIBRARY_PATH=/usr/local/lib
ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data
ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data --broker-endpoints={{ etcd_endpoints }}
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
KillSignal=SIGINT

View File

@@ -1,18 +1,18 @@
version: 2.1
executors:
zenith-xlarge-executor:
neon-xlarge-executor:
resource_class: xlarge
docker:
# NB: when changed, do not forget to update rust image tag in all Dockerfiles
- image: zimg/rust:1.56
zenith-executor:
- image: zimg/rust:1.58
neon-executor:
docker:
- image: zimg/rust:1.56
- image: zimg/rust:1.58
jobs:
check-codestyle-rust:
executor: zenith-xlarge-executor
executor: neon-xlarge-executor
steps:
- checkout
- run:
@@ -22,7 +22,7 @@ jobs:
# A job to build postgres
build-postgres:
executor: zenith-xlarge-executor
executor: neon-xlarge-executor
parameters:
build_type:
type: enum
@@ -34,10 +34,13 @@ jobs:
- checkout
# Grab the postgres git revision to build a cache key.
# Append makefile as it could change the way postgres is built.
# Note this works even though the submodule hasn't been checkout out yet.
- run:
name: Get postgres cache key
command: git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
command: |
git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
cat Makefile >> /tmp/cache-key-postgres
- restore_cache:
name: Restore postgres cache
@@ -64,9 +67,9 @@ jobs:
paths:
- tmp_install
# A job to build zenith rust code
build-zenith:
executor: zenith-xlarge-executor
# A job to build Neon rust code
build-neon:
executor: neon-xlarge-executor
parameters:
build_type:
type: enum
@@ -78,11 +81,14 @@ jobs:
- checkout
# Grab the postgres git revision to build a cache key.
# Append makefile as it could change the way postgres is built.
# Note this works even though the submodule hasn't been checkout out yet.
- run:
name: Get postgres cache key
command: |
git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres
cat Makefile >> /tmp/cache-key-postgres
- restore_cache:
name: Restore postgres cache
@@ -107,11 +113,16 @@ jobs:
CARGO_FLAGS=
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
CARGO_FLAGS=--release
CARGO_FLAGS="--release --features profiling"
fi
export CARGO_INCREMENTAL=0
export CACHEPOT_BUCKET=zenith-rust-cachepot
export RUSTC_WRAPPER=cachepot
export AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}"
"${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests
cachepot -s
- save_cache:
name: Save rust cache
@@ -121,31 +132,19 @@ jobs:
- ~/.cargo/git
- target
# Run style checks
# has to run separately from cargo fmt section
# since needs to run with dependencies
- run:
name: cargo clippy
command: |
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
"${cov_prefix[@]}" ./run_clippy.sh
# Run rust unit tests
- run:
name: cargo test
command: |
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
CARGO_FLAGS=
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
CARGO_FLAGS=--release
fi
"${cov_prefix[@]}" cargo test
"${cov_prefix[@]}" cargo test $CARGO_FLAGS
# Install the rust binaries, for use by test jobs
- run:
@@ -210,17 +209,17 @@ jobs:
- "*"
check-codestyle-python:
executor: zenith-executor
executor: neon-executor
steps:
- checkout
- restore_cache:
keys:
- v1-python-deps-{{ checksum "poetry.lock" }}
- v2-python-deps-{{ checksum "poetry.lock" }}
- run:
name: Install deps
command: ./scripts/pysync
- save_cache:
key: v1-python-deps-{{ checksum "poetry.lock" }}
key: v2-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
@@ -233,7 +232,7 @@ jobs:
command: poetry run mypy .
run-pytest:
executor: zenith-executor
executor: neon-executor
parameters:
# pytest args to specify the tests to run.
#
@@ -274,12 +273,12 @@ jobs:
- run: git submodule update --init --depth 1
- restore_cache:
keys:
- v1-python-deps-{{ checksum "poetry.lock" }}
- v2-python-deps-{{ checksum "poetry.lock" }}
- run:
name: Install deps
command: ./scripts/pysync
- save_cache:
key: v1-python-deps-{{ checksum "poetry.lock" }}
key: v2-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
@@ -356,7 +355,7 @@ jobs:
when: always
command: |
du -sh /tmp/test_output/*
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" -delete
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete
du -sh /tmp/test_output/*
- store_artifacts:
path: /tmp/test_output
@@ -377,7 +376,7 @@ jobs:
- "*"
coverage-report:
executor: zenith-xlarge-executor
executor: neon-xlarge-executor
steps:
- attach_workspace:
at: /tmp/zenith
@@ -392,7 +391,7 @@ jobs:
- run:
name: Build coverage report
command: |
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1
scripts/coverage \
--dir=/tmp/zenith/coverage report \
@@ -403,11 +402,11 @@ jobs:
name: Upload coverage report
command: |
LOCAL_REPO=$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME
REPORT_URL=https://zenithdb.github.io/zenith-coverage-data/$CIRCLE_SHA1
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
REPORT_URL=https://neondatabase.github.io/zenith-coverage-data/$CIRCLE_SHA1
COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1
scripts/git-upload \
--repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-coverage-data.git \
--repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/neondatabase/zenith-coverage-data.git \
--message="Add code coverage for $COMMIT_URL" \
copy /tmp/zenith/coverage/report $CIRCLE_SHA1 # COPY FROM TO_RELATIVE
@@ -424,7 +423,7 @@ jobs:
\"target_url\": \"$REPORT_URL\"
}"
# Build zenithdb/zenith:latest image and push it to Docker hub
# Build neondatabase/neon:latest image and push it to Docker hub
docker-image:
docker:
- image: cimg/base:2021.04
@@ -438,18 +437,18 @@ jobs:
- run:
name: Build and push Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG=$(git log --oneline|wc -l)
docker build \
--pull \
--build-arg GIT_VERSION=${CIRCLE_SHA1} \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/zenith:${DOCKER_TAG} --tag zenithdb/zenith:latest .
docker push zenithdb/zenith:${DOCKER_TAG}
docker push zenithdb/zenith:latest
--tag neondatabase/neon:${DOCKER_TAG} --tag neondatabase/neon:latest .
docker push neondatabase/neon:${DOCKER_TAG}
docker push neondatabase/neon:latest
# Build zenithdb/compute-node:latest image and push it to Docker hub
# Build neondatabase/compute-node:latest image and push it to Docker hub
docker-image-compute:
docker:
- image: cimg/base:2021.04
@@ -457,28 +456,31 @@ jobs:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Build zenithdb/compute-tools:latest image and push it to Docker hub
# Build neondatabase/compute-tools:latest image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run:
name: Build and push compute-tools Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
docker build -t zenithdb/compute-tools:latest -f Dockerfile.compute-tools .
docker push zenithdb/compute-tools:latest
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag neondatabase/compute-tools:latest -f Dockerfile.compute-tools .
docker push neondatabase/compute-tools:latest
- run:
name: Init postgres submodule
command: git submodule update --init --depth 1
- run:
name: Build and push compute-node Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG=$(git log --oneline|wc -l)
docker build --tag zenithdb/compute-node:${DOCKER_TAG} --tag zenithdb/compute-node:latest vendor/postgres
docker push zenithdb/compute-node:${DOCKER_TAG}
docker push zenithdb/compute-node:latest
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:latest vendor/postgres
docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:latest
# Build production zenithdb/zenith:release image and push it to Docker hub
# Build production neondatabase/neon:release image and push it to Docker hub
docker-image-release:
docker:
- image: cimg/base:2021.04
@@ -492,18 +494,18 @@ jobs:
- run:
name: Build and push Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG="release-$(git log --oneline|wc -l)"
docker build \
--pull \
--build-arg GIT_VERSION=${CIRCLE_SHA1} \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/zenith:${DOCKER_TAG} --tag zenithdb/zenith:release .
docker push zenithdb/zenith:${DOCKER_TAG}
docker push zenithdb/zenith:release
--tag neondatabase/neon:${DOCKER_TAG} --tag neondatabase/neon:release .
docker push neondatabase/neon:${DOCKER_TAG}
docker push neondatabase/neon:release
# Build production zenithdb/compute-node:release image and push it to Docker hub
# Build production neondatabase/compute-node:release image and push it to Docker hub
docker-image-compute-release:
docker:
- image: cimg/base:2021.04
@@ -511,26 +513,29 @@ jobs:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Build zenithdb/compute-tools:release image and push it to Docker hub
# Build neondatabase/compute-tools:release image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run:
name: Build and push compute-tools Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
docker build -t zenithdb/compute-tools:release -f Dockerfile.compute-tools .
docker push zenithdb/compute-tools:release
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag neondatabase/compute-tools:release -f Dockerfile.compute-tools .
docker push neondatabase/compute-tools:release
- run:
name: Init postgres submodule
command: git submodule update --init --depth 1
- run:
name: Build and push compute-node Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG="release-$(git log --oneline|wc -l)"
docker build --tag zenithdb/compute-node:${DOCKER_TAG} --tag zenithdb/compute-node:release vendor/postgres
docker push zenithdb/compute-node:${DOCKER_TAG}
docker push zenithdb/compute-node:release
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:release vendor/postgres
docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:release
deploy-staging:
docker:
@@ -556,7 +561,7 @@ jobs:
rm -f ssh-key ssh-key-cert.pub
ansible-playbook deploy.yaml -i staging.hosts
rm -f zenith_install.tar.gz .zenith_current_version
rm -f neon_install.tar.gz .neon_current_version
deploy-staging-proxy:
docker:
@@ -574,7 +579,7 @@ jobs:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add zenithdb https://zenithdb.github.io/helm-charts
helm repo add zenithdb https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
@@ -605,8 +610,8 @@ jobs:
ssh-add ssh-key
rm -f ssh-key ssh-key-cert.pub
ansible-playbook deploy.yaml -i production.hosts -e console_mgmt_base_url=http://console-release.local
rm -f zenith_install.tar.gz .zenith_current_version
ansible-playbook deploy.yaml -i production.hosts
rm -f neon_install.tar.gz .neon_current_version
deploy-release-proxy:
docker:
@@ -624,7 +629,7 @@ jobs:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add zenithdb https://zenithdb.github.io/helm-charts
helm repo add zenithdb https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
@@ -653,7 +658,7 @@ jobs:
--data \
"{
\"state\": \"pending\",
\"context\": \"zenith-remote-ci\",
\"context\": \"neon-cloud-e2e\",
\"description\": \"[$REMOTE_REPO] Remote CI job is about to start\"
}"
- run:
@@ -669,7 +674,7 @@ jobs:
"{
\"ref\": \"main\",
\"inputs\": {
\"ci_job_name\": \"zenith-remote-ci\",
\"ci_job_name\": \"neon-cloud-e2e\",
\"commit_hash\": \"$CIRCLE_SHA1\",
\"remote_repo\": \"$LOCAL_REPO\"
}
@@ -685,8 +690,8 @@ workflows:
matrix:
parameters:
build_type: ["debug", "release"]
- build-zenith:
name: build-zenith-<< matrix.build_type >>
- build-neon:
name: build-neon-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
@@ -701,7 +706,7 @@ workflows:
test_selection: batch_pg_regress
needs_postgres_source: true
requires:
- build-zenith-<< matrix.build_type >>
- build-neon-<< matrix.build_type >>
- run-pytest:
name: other-tests-<< matrix.build_type >>
matrix:
@@ -709,7 +714,7 @@ workflows:
build_type: ["debug", "release"]
test_selection: batch_others
requires:
- build-zenith-<< matrix.build_type >>
- build-neon-<< matrix.build_type >>
- run-pytest:
name: benchmarks
context: PERF_TEST_RESULT_CONNSTR
@@ -718,7 +723,7 @@ workflows:
run_in_parallel: false
save_perf_report: true
requires:
- build-zenith-release
- build-neon-release
- coverage-report:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN
@@ -809,11 +814,11 @@ workflows:
- remote-ci-trigger:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN
remote_repo: "zenithdb/console"
remote_repo: "neondatabase/cloud"
requires:
# XXX: Successful build doesn't mean everything is OK, but
# the job to be triggered takes so much time to complete (~22 min)
# that it's better not to wait for the commented-out steps
- build-zenith-debug
- build-neon-release
# - pg_regress-tests-release
# - other-tests-release

View File

@@ -1,9 +1,12 @@
# Helm chart values for zenith-proxy.
# This is a YAML-formatted file.
image:
repository: neondatabase/neon
settings:
authEndpoint: "https://console.zenith.tech/authenticate_proxy_request/"
uri: "https://console.zenith.tech/psql_session/"
authEndpoint: "https://console.neon.tech/authenticate_proxy_request/"
uri: "https://console.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
@@ -25,7 +28,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: start.zenith.tech
external-dns.alpha.kubernetes.io/hostname: start.zenith.tech,connect.neon.tech,pg.neon.tech
metrics:
enabled: true

View File

@@ -1,9 +1,12 @@
# Helm chart values for zenith-proxy.
# This is a YAML-formatted file.
image:
repository: neondatabase/neon
settings:
authEndpoint: "https://console.stage.zenith.tech/authenticate_proxy_request/"
uri: "https://console.stage.zenith.tech/psql_session/"
authEndpoint: "https://console.stage.neon.tech/authenticate_proxy_request/"
uri: "https://console.stage.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
@@ -17,7 +20,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: start.stage.zenith.tech
external-dns.alpha.kubernetes.io/hostname: connect.stage.neon.tech
metrics:
enabled: true

26
.config/hakari.toml Normal file
View File

@@ -0,0 +1,26 @@
# This file contains settings for `cargo hakari`.
# See https://docs.rs/cargo-hakari/latest/cargo_hakari/config for a full list of options.
hakari-package = "workspace_hack"
# Format for `workspace-hack = ...` lines in other Cargo.tomls. Requires cargo-hakari 0.9.8 or above.
dep-format-version = "2"
# Setting workspace.resolver = "2" in the root Cargo.toml is HIGHLY recommended.
# Hakari works much better with the new feature resolver.
# For more about the new feature resolver, see:
# https://blog.rust-lang.org/2021/03/25/Rust-1.51.0.html#cargos-new-feature-resolver
# Have to keep the resolver still here since hakari requires this field,
# despite it's now the default for 2021 edition & cargo.
resolver = "2"
# Add triples corresponding to platforms commonly used by developers here.
# https://doc.rust-lang.org/rustc/platform-support.html
platforms = [
# "x86_64-unknown-linux-gnu",
# "x86_64-apple-darwin",
# "x86_64-pc-windows-msvc",
]
# Write out exact versions rather than a semver range. (Defaults to false.)
# exact-versions = true

View File

@@ -26,7 +26,7 @@ jobs:
runs-on: [self-hosted, zenith-benchmarker]
env:
PG_BIN: "/usr/pgsql-13/bin"
POSTGRES_DISTRIB_DIR: "/usr/pgsql-13"
steps:
- name: Checkout zenith repo
@@ -51,7 +51,7 @@ jobs:
echo Poetry
poetry --version
echo Pgbench
$PG_BIN/pgbench --version
$POSTGRES_DISTRIB_DIR/bin/pgbench --version
# FIXME cluster setup is skipped due to various changes in console API
# for now pre created cluster is used. When API gain some stability
@@ -66,7 +66,7 @@ jobs:
echo "Starting cluster"
# wake up the cluster
$PG_BIN/psql $BENCHMARK_CONNSTR -c "SELECT 1"
$POSTGRES_DISTRIB_DIR/bin/psql $BENCHMARK_CONNSTR -c "SELECT 1"
- name: Run benchmark
# pgbench is installed system wide from official repo
@@ -83,8 +83,11 @@ jobs:
# sudo yum install postgresql13-contrib
# actual binaries are located in /usr/pgsql-13/bin/
env:
TEST_PG_BENCH_TRANSACTIONS_MATRIX: "5000,10000,20000"
TEST_PG_BENCH_SCALES_MATRIX: "10,15"
# The pgbench test runs two tests of given duration against each scale.
# So the total runtime with these parameters is 2 * 2 * 300 = 1200, or 20 minutes.
# Plus time needed to initialize the test databases.
TEST_PG_BENCH_DURATIONS_MATRIX: "300"
TEST_PG_BENCH_SCALES_MATRIX: "10,100"
PLATFORM: "zenith-staging"
BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}"
REMOTE_ENV: "1" # indicate to test harness that we do not have zenith binaries locally

View File

@@ -1,10 +1,6 @@
name: Build and Test
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
on: push
jobs:
regression-check:
@@ -13,7 +9,7 @@ jobs:
# If we want to duplicate this job for different
# Rust toolchains (e.g. nightly or 1.37.0), add them here.
rust_toolchain: [stable]
os: [ubuntu-latest]
os: [ubuntu-latest, macos-latest]
timeout-minutes: 30
name: run regression test suite
runs-on: ${{ matrix.os }}
@@ -32,11 +28,16 @@ jobs:
toolchain: ${{ matrix.rust_toolchain }}
override: true
- name: Install postgres dependencies
- name: Install Ubuntu postgres dependencies
if: matrix.os == 'ubuntu-latest'
run: |
sudo apt update
sudo apt install build-essential libreadline-dev zlib1g-dev flex bison libseccomp-dev
- name: Install macOs postgres dependencies
if: matrix.os == 'macos-latest'
run: brew install flex bison
- name: Set pg revision for caching
id: pg_ver
run: echo ::set-output name=pg_rev::$(git rev-parse HEAD:vendor/postgres)
@@ -51,8 +52,7 @@ jobs:
- name: Build postgres
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
make postgres
run: make postgres
- name: Cache cargo deps
id: cache_cargo
@@ -62,13 +62,10 @@ jobs:
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
key: ${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}
# Use `env CARGO_INCREMENTAL=0` to mitigate https://github.com/rust-lang/rust/issues/91696 for rustc 1.57.0
- name: Run cargo build
run: |
env CARGO_INCREMENTAL=0 cargo build --workspace --bins --examples --tests
- name: Run cargo clippy
run: ./run_clippy.sh
- name: Run cargo test
run: |
env CARGO_INCREMENTAL=0 cargo test -- --nocapture --test-threads=1
run: cargo test --all --all-targets

1535
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -3,13 +3,11 @@ members = [
"compute_tools",
"control_plane",
"pageserver",
"postgres_ffi",
"proxy",
"walkeeper",
"safekeeper",
"workspace_hack",
"zenith",
"zenith_metrics",
"zenith_utils",
"libs/*",
]
[profile.release]
@@ -17,7 +15,7 @@ members = [
# Besides, debug info should not affect the performance.
debug = true
# This is only needed for proxy's tests
# TODO: we should probably fork tokio-postgres-rustls instead
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
[patch.crates-io]
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }

View File

@@ -1,7 +1,5 @@
# Build Postgres
#
#FROM zimg/rust:1.56 AS pg-build
FROM zenithdb/build:buster-20220309 AS pg-build
FROM zimg/rust:1.58 AS pg-build
WORKDIR /pg
USER root
@@ -11,26 +9,26 @@ COPY Makefile Makefile
ENV BUILD_TYPE release
RUN set -e \
&& make -j $(nproc) -s postgres \
&& mold -run make -j $(nproc) -s postgres \
&& rm -rf tmp_install/build \
&& tar -C tmp_install -czf /postgres_install.tar.gz .
# Build zenith binaries
#
#FROM zimg/rust:1.56 AS build
FROM zenithdb/build:buster-20220309 AS build
FROM zimg/rust:1.58 AS build
ARG GIT_VERSION=local
ARG CACHEPOT_BUCKET=zenith-rust-cachepot
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
#ENV RUSTC_WRAPPER cachepot
ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot
COPY --from=pg-build /pg/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY . .
RUN cargo build --release
# Show build caching stats to check if it was used in the end.
# Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats.
RUN set -e \
&& sudo -E "PATH=$PATH" mold -run cargo build --release \
&& cachepot -s
# Build final image
#

View File

@@ -1,23 +0,0 @@
FROM rust:1.56.1-slim-buster
WORKDIR /home/circleci/project
RUN set -e \
&& apt-get update \
&& apt-get -yq install \
automake \
libtool \
build-essential \
bison \
flex \
libreadline-dev \
zlib1g-dev \
libxml2-dev \
libseccomp-dev \
pkg-config \
libssl-dev \
clang
RUN set -e \
&& rustup component add clippy \
&& cargo install cargo-audit \
&& cargo install --git https://github.com/paritytech/cachepot

View File

@@ -1,14 +1,18 @@
# First transient image to build compute_tools binaries
# NB: keep in sync with rust image version in .circle/config.yml
FROM rust:1.56.1-slim-buster AS rust-build
FROM zimg/rust:1.58 AS rust-build
WORKDIR /zenith
ARG CACHEPOT_BUCKET=zenith-rust-cachepot
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
COPY . .
RUN cargo build -p compute_tools --release
RUN set -e \
&& sudo -E "PATH=$PATH" mold -run cargo build -p compute_tools --release \
&& cachepot -s
# Final image that only has one binary
FROM debian:buster-slim
COPY --from=rust-build /zenith/target/release/zenith_ctl /usr/local/bin/zenith_ctl
COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl

View File

@@ -78,6 +78,11 @@ postgres: postgres-configure \
$(MAKE) -C tmp_install/build/contrib/zenith install
+@echo "Compiling contrib/zenith_test_utils"
$(MAKE) -C tmp_install/build/contrib/zenith_test_utils install
+@echo "Compiling pg_buffercache"
$(MAKE) -C tmp_install/build/contrib/pg_buffercache install
+@echo "Compiling pageinspect"
$(MAKE) -C tmp_install/build/contrib/pageinspect install
.PHONY: postgres-clean
postgres-clean:

View File

@@ -1,19 +1,22 @@
# Zenith
# Neon
Zenith is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
Neon is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
The project used to be called "Zenith". Many of the commands and code comments
still refer to "zenith", but we are in the process of renaming things.
## Architecture overview
A Zenith installation consists of compute nodes and Zenith storage engine.
A Neon installation consists of compute nodes and Neon storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by Zenith storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by Neon storage engine.
Zenith storage engine consists of two major components:
Neon storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
- WAL service. The service that receives WAL from compute node and ensures that it is stored durably.
Pageserver consists of:
- Repository - Zenith storage implementation.
- Repository - Neon storage implementation.
- WAL receiver - service that receives WAL from WAL service and stores it in the repository.
- Page service - service that communicates with compute nodes and responds with pages from the repository.
- WAL redo - service that builds pages from base images and WAL records on Page service request.
@@ -28,17 +31,17 @@ apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libsec
libssl-dev clang pkg-config libpq-dev
```
[Rust] 1.56.1 or later is also required.
[Rust] 1.58 or later is also required.
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
2. Build zenith and patched postgres
2. Build neon and patched postgres
```sh
git clone --recursive https://github.com/zenithdb/zenith.git
cd zenith
git clone --recursive https://github.com/neondatabase/neon.git
cd neon
make -j5
```
@@ -126,7 +129,7 @@ INSERT 0 1
## Running tests
```sh
git clone --recursive https://github.com/zenithdb/zenith.git
git clone --recursive https://github.com/neondatabase/neon.git
make # builds also postgres and installs it to ./tmp_install
./scripts/pytest
```
@@ -141,14 +144,14 @@ To view your `rustdoc` documentation in a browser, try running `cargo doc --no-d
### Postgres-specific terms
Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used.
Due to Neon's very close relation with PostgreSQL internals, there are numerous specific terms used.
Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
To get more familiar with this aspect, refer to:
- [Zenith glossary](/docs/glossary.md)
- [Neon glossary](/docs/glossary.md)
- [PostgreSQL glossary](https://www.postgresql.org/docs/13/glossary.html)
- Other PostgreSQL documentation and sources (Zenith fork sources can be found [here](https://github.com/zenithdb/postgres))
- Other PostgreSQL documentation and sources (Neon fork sources can be found [here](https://github.com/neondatabase/postgres))
## Join the development

View File

@@ -11,9 +11,11 @@ clap = "3.0"
env_logger = "0.9"
hyper = { version = "0.14", features = ["full"] }
log = { version = "0.4", features = ["std", "serde"] }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
regex = "1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tar = "0.4"
tokio = { version = "1", features = ["macros", "rt", "rt-multi-thread"] }
tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }

View File

@@ -38,6 +38,7 @@ use clap::Arg;
use log::info;
use postgres::{Client, NoTls};
use compute_tools::checker::create_writablity_check_data;
use compute_tools::config;
use compute_tools::http_api::launch_http_server;
use compute_tools::logger::*;
@@ -128,6 +129,8 @@ fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
handle_roles(&read_state.spec, &mut client)?;
handle_databases(&read_state.spec, &mut client)?;
handle_grants(&read_state.spec, &mut client)?;
create_writablity_check_data(&mut client)?;
// 'Close' connection
drop(client);
@@ -155,7 +158,7 @@ fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
}
fn main() -> Result<()> {
// TODO: re-use `zenith_utils::logging` later
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
// Env variable is set by `cargo`

View File

@@ -0,0 +1,46 @@
use std::sync::{Arc, RwLock};
use anyhow::{anyhow, Result};
use log::error;
use postgres::Client;
use tokio_postgres::NoTls;
use crate::zenith::ComputeState;
pub fn create_writablity_check_data(client: &mut Client) -> Result<()> {
let query = "
CREATE TABLE IF NOT EXISTS health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();";
let result = client.simple_query(query)?;
if result.len() < 2 {
return Err(anyhow::format_err!("executed {} queries", result.len()));
}
Ok(())
}
pub async fn check_writability(state: &Arc<RwLock<ComputeState>>) -> Result<()> {
let connstr = state.read().unwrap().connstr.clone();
let (client, connection) = tokio_postgres::connect(&connstr, NoTls).await?;
if client.is_closed() {
return Err(anyhow!("connection to postgres closed"));
}
tokio::spawn(async move {
if let Err(e) = connection.await {
error!("connection error: {}", e);
}
});
let result = client
.simple_query("UPDATE health_check SET updated_at = now() WHERE id = 1;")
.await?;
if result.len() != 1 {
return Err(anyhow!("statement can't be executed"));
}
Ok(())
}

View File

@@ -11,7 +11,7 @@ use log::{error, info};
use crate::zenith::*;
// Service function to handle all available routes.
fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body> {
async fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body> {
match (req.method(), req.uri().path()) {
// Timestamp of the last Postgres activity in the plain text.
(&Method::GET, "/last_activity") => {
@@ -29,6 +29,15 @@ fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body
Response::new(Body::from(format!("{}", state.ready)))
}
(&Method::GET, "/check_writability") => {
info!("serving /check_writability GET request");
let res = crate::checker::check_writability(&state).await;
match res {
Ok(_) => Response::new(Body::from("true")),
Err(e) => Response::new(Body::from(e.to_string())),
}
}
// Return the `404 Not Found` for any other routes.
_ => {
let mut not_found = Response::new(Body::from("404 Not Found"));
@@ -48,7 +57,7 @@ async fn serve(state: Arc<RwLock<ComputeState>>) {
async move {
Ok::<_, Infallible>(service_fn(move |req: Request<Body>| {
let state = state.clone();
async move { Ok::<_, Infallible>(routes(req, state)) }
async move { Ok::<_, Infallible>(routes(req, state).await) }
}))
}
});

View File

@@ -2,6 +2,7 @@
//! Various tools and helpers to handle cluster / compute node (Postgres)
//! configuration.
//!
pub mod checker;
pub mod config;
pub mod http_api;
#[macro_use]

View File

@@ -132,7 +132,14 @@ impl Role {
let mut params: String = "LOGIN".to_string();
if let Some(pass) = &self.encrypted_password {
params.push_str(&format!(" PASSWORD 'md5{}'", pass));
// Some time ago we supported only md5 and treated all encrypted_password as md5.
// Now we also support SCRAM-SHA-256 and to preserve compatibility
// we treat all encrypted_password as md5 unless they starts with SCRAM-SHA-256.
if pass.starts_with("SCRAM-SHA-256") {
params.push_str(&format!(" PASSWORD '{}'", pass));
} else {
params.push_str(&format!(" PASSWORD 'md5{}'", pass));
}
} else {
params.push_str(" PASSWORD NULL");
}

View File

@@ -244,3 +244,24 @@ pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
Ok(())
}
// Grant CREATE ON DATABASE to the database owner
// to allow clients create trusted extensions.
pub fn handle_grants(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
info!("cluster spec grants:");
for db in &spec.cluster.databases {
let dbname = &db.name;
let query: String = format!(
"GRANT CREATE ON DATABASE {} TO {}",
dbname.quote(),
db.owner.quote()
);
info!("grant query {}", &query);
client.execute(query.as_str(), &[])?;
}
Ok(())
}

View File

@@ -5,7 +5,7 @@ edition = "2021"
[dependencies]
tar = "0.4.33"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
serde = { version = "1.0", features = ["derive"] }
serde_with = "1.12.0"
toml = "0.5"
@@ -18,6 +18,6 @@ url = "2.2.2"
reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
pageserver = { path = "../pageserver" }
walkeeper = { path = "../walkeeper" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { path = "../workspace_hack" }
safekeeper = { path = "../safekeeper" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }

View File

@@ -11,11 +11,12 @@ use std::sync::Arc;
use std::time::Duration;
use anyhow::{Context, Result};
use zenith_utils::connstring::connection_host_port;
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
use utils::{
connstring::connection_host_port,
lsn::Lsn,
postgres_backend::AuthType,
zid::{ZTenantId, ZTimelineId},
};
use crate::local_env::LocalEnv;
use crate::postgresql_conf::PostgresConf;
@@ -272,12 +273,7 @@ impl PostgresNode {
conf.append("wal_sender_timeout", "5s");
conf.append("listen_addresses", &self.address.ip().to_string());
conf.append("port", &self.address.port().to_string());
// Never clean up old WAL. TODO: We should use a replication
// slot or something proper, to prevent the compute node
// from removing WAL that hasn't been streamed to the safekeeper or
// page server yet. (gh issue #349)
conf.append("wal_keep_size", "10TB");
conf.append("wal_keep_size", "0");
// Configure the node to fetch pages from pageserver
let pageserver_connstr = {
@@ -331,14 +327,14 @@ impl PostgresNode {
// Configure the node to connect to the safekeepers
conf.append("synchronous_standby_names", "walproposer");
let wal_acceptors = self
let safekeepers = self
.env
.safekeepers
.iter()
.map(|sk| format!("localhost:{}", sk.pg_port))
.collect::<Vec<String>>()
.join(",");
conf.append("wal_acceptors", &wal_acceptors);
conf.append("wal_acceptors", &safekeepers);
} else {
// We only use setup without safekeepers for tests,
// and don't care about data durability on pageserver,
@@ -420,10 +416,15 @@ impl PostgresNode {
if let Some(token) = auth_token {
cmd.env("ZENITH_AUTH_TOKEN", token);
}
let pg_ctl = cmd.status().context("pg_ctl failed")?;
if !pg_ctl.success() {
anyhow::bail!("pg_ctl failed");
let pg_ctl = cmd.output().context("pg_ctl failed")?;
if !pg_ctl.status.success() {
anyhow::bail!(
"pg_ctl failed, exit code: {}, stdout: {}, stderr: {}",
pg_ctl.status,
String::from_utf8_lossy(&pg_ctl.stdout),
String::from_utf8_lossy(&pg_ctl.stderr),
);
}
Ok(())
}

View File

@@ -11,9 +11,11 @@ use std::env;
use std::fs;
use std::path::{Path, PathBuf};
use std::process::{Command, Stdio};
use zenith_utils::auth::{encode_from_key_file, Claims, Scope};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId};
use utils::{
auth::{encode_from_key_file, Claims, Scope},
postgres_backend::AuthType,
zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId},
};
use crate::safekeeper::SafekeeperNode;
@@ -57,6 +59,10 @@ pub struct LocalEnv {
#[serde(default)]
pub private_key_path: PathBuf,
// A comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'.
#[serde(default)]
pub broker_endpoints: Option<String>,
pub pageserver: PageServerConf,
#[serde(default)]

View File

@@ -13,15 +13,17 @@ use nix::unistd::Pid;
use postgres::Config;
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use safekeeper::http::models::TimelineCreateRequest;
use thiserror::Error;
use walkeeper::http::models::TimelineCreateRequest;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId};
use utils::{
connstring::connection_address,
http::error::HttpErrorBody,
zid::{ZNodeId, ZTenantId, ZTimelineId},
};
use crate::local_env::{LocalEnv, SafekeeperConf};
use crate::storage::PageServerNode;
use crate::{fill_rust_env_vars, read_pidfile};
use zenith_utils::connstring::connection_address;
#[derive(Error, Debug)]
pub enum SafekeeperHttpError {
@@ -73,6 +75,8 @@ pub struct SafekeeperNode {
pub http_base_url: String,
pub pageserver: Arc<PageServerNode>,
broker_endpoints: Option<String>,
}
impl SafekeeperNode {
@@ -89,6 +93,7 @@ impl SafekeeperNode {
http_client: Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
pageserver,
broker_endpoints: env.broker_endpoints.clone(),
}
}
@@ -135,6 +140,9 @@ impl SafekeeperNode {
if !self.conf.sync {
cmd.arg("--no-sync");
}
if let Some(ref ep) = self.broker_endpoints {
cmd.args(&["--broker-endpoints", ep]);
}
if !cmd.status()?.success() {
bail!(

View File

@@ -1,3 +1,4 @@
use std::collections::HashMap;
use std::io::Write;
use std::net::TcpStream;
use std::path::PathBuf;
@@ -9,21 +10,23 @@ use anyhow::{bail, Context};
use nix::errno::Errno;
use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid;
use pageserver::http::models::{TenantCreateRequest, TimelineCreateRequest};
use pageserver::http::models::{TenantConfigRequest, TenantCreateRequest, TimelineCreateRequest};
use pageserver::timelines::TimelineInfo;
use postgres::{Config, NoTls};
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use utils::{
connstring::connection_address,
http::error::HttpErrorBody,
lsn::Lsn,
postgres_backend::AuthType,
zid::{ZTenantId, ZTimelineId},
};
use crate::local_env::LocalEnv;
use crate::{fill_rust_env_vars, read_pidfile};
use pageserver::tenant_mgr::TenantInfo;
use zenith_utils::connstring::connection_address;
#[derive(Error, Debug)]
pub enum PageserverHttpError {
@@ -148,12 +151,20 @@ impl PageServerNode {
let initial_timeline_id_string = initial_timeline_id.to_string();
args.extend(["--initial-timeline-id", &initial_timeline_id_string]);
let init_output = fill_rust_env_vars(cmd.args(args))
let cmd_with_args = cmd.args(args);
let init_output = fill_rust_env_vars(cmd_with_args)
.output()
.context("pageserver init failed")?;
.with_context(|| {
format!("failed to init pageserver with command {:?}", cmd_with_args)
})?;
if !init_output.status.success() {
bail!("pageserver init failed");
bail!(
"init invocation failed, {}\nStdout: {}\nStderr: {}",
init_output.status,
String::from_utf8_lossy(&init_output.stdout),
String::from_utf8_lossy(&init_output.stderr)
);
}
Ok(initial_timeline_id)
@@ -334,10 +345,36 @@ impl PageServerNode {
pub fn tenant_create(
&self,
new_tenant_id: Option<ZTenantId>,
settings: HashMap<&str, &str>,
) -> anyhow::Result<Option<ZTenantId>> {
let tenant_id_string = self
.http_request(Method::POST, format!("{}/tenant", self.http_base_url))
.json(&TenantCreateRequest { new_tenant_id })
.json(&TenantCreateRequest {
new_tenant_id,
checkpoint_distance: settings
.get("checkpoint_distance")
.map(|x| x.parse::<u64>())
.transpose()?,
compaction_target_size: settings
.get("compaction_target_size")
.map(|x| x.parse::<u64>())
.transpose()?,
compaction_period: settings.get("compaction_period").map(|x| x.to_string()),
compaction_threshold: settings
.get("compaction_threshold")
.map(|x| x.parse::<usize>())
.transpose()?,
gc_horizon: settings
.get("gc_horizon")
.map(|x| x.parse::<u64>())
.transpose()?,
gc_period: settings.get("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.get("image_creation_threshold")
.map(|x| x.parse::<usize>())
.transpose()?,
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
})
.send()?
.error_from_body()?
.json::<Option<String>>()?;
@@ -354,6 +391,35 @@ impl PageServerNode {
.transpose()
}
pub fn tenant_config(&self, tenant_id: ZTenantId, settings: HashMap<&str, &str>) -> Result<()> {
self.http_request(Method::PUT, format!("{}/tenant/config", self.http_base_url))
.json(&TenantConfigRequest {
tenant_id,
checkpoint_distance: settings
.get("checkpoint_distance")
.map(|x| x.parse::<u64>().unwrap()),
compaction_target_size: settings
.get("compaction_target_size")
.map(|x| x.parse::<u64>().unwrap()),
compaction_period: settings.get("compaction_period").map(|x| x.to_string()),
compaction_threshold: settings
.get("compaction_threshold")
.map(|x| x.parse::<usize>().unwrap()),
gc_horizon: settings
.get("gc_horizon")
.map(|x| x.parse::<u64>().unwrap()),
gc_period: settings.get("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.get("image_creation_threshold")
.map(|x| x.parse::<usize>().unwrap()),
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
})
.send()?
.error_from_body()?;
Ok(())
}
pub fn timeline_list(&self, tenant_id: &ZTenantId) -> anyhow::Result<Vec<TimelineInfo>> {
let timeline_infos: Vec<TimelineInfo> = self
.http_request(

View File

@@ -7,8 +7,8 @@
- [glossary.md](glossary.md) — Glossary of all the terms used in codebase.
- [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
- [sourcetree.md](sourcetree.md) — Overview of the source tree layeout.
- [pageserver/README](/pageserver/README) — pageserver overview.
- [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview.
- [pageserver/README.md](/pageserver/README.md) — pageserver overview.
- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
- [walkeeper/README](/walkeeper/README) — WAL service overview.
- [safekeeper/README.md](/safekeeper/README.md) — WAL service overview.
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core

View File

@@ -27,4 +27,4 @@ management_token = jwt.encode({"scope": "pageserverapi"}, auth_keys.priv, algori
tenant_token = jwt.encode({"scope": "tenant", "tenant_id": ps.initial_tenant}, auth_keys.priv, algorithm="RS256")
```
Utility functions to work with jwts in rust are located in zenith_utils/src/auth.rs
Utility functions to work with jwts in rust are located in libs/utils/src/auth.rs

View File

@@ -21,7 +21,7 @@ NOTE:It has nothing to do with PostgreSQL pg_basebackup.
### Branch
We can create branch at certain LSN using `zenith branch` command.
We can create branch at certain LSN using `zenith timeline branch` command.
Each Branch lives in a corresponding timeline[] and has an ancestor[].
@@ -29,24 +29,32 @@ Each Branch lives in a corresponding timeline[] and has an ancestor[].
NOTE: This is an overloaded term.
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
### Checkpoint (Layered repository)
NOTE: This is an overloaded term.
Whenever enough WAL has been accumulated in memory, the page server []
writes out the changes from in-memory layers into new layer files[]. This process
is called "checkpointing". The page server only creates layer files for
relations that have been modified since the last checkpoint.
writes out the changes from the in-memory layer into a new delta layer file. This process
is called "checkpointing".
Configuration parameter `checkpoint_distance` defines the distance
from current LSN to perform checkpoint of in-memory layers.
Default is `DEFAULT_CHECKPOINT_DISTANCE`.
Set this parameter to `0` to force checkpoint of every layer.
Configuration parameter `checkpoint_period` defines the interval between checkpoint iterations.
Default is `DEFAULT_CHECKPOINT_PERIOD`.
### Compaction
A background operation on layer files. Compaction takes a number of L0
layer files, each of which covers the whole key space and a range of
LSN, and reshuffles the data in them into L1 files so that each file
covers the whole LSN range, but only part of the key space.
Compaction should also opportunistically leave obsolete page versions
from the L1 files, and materialize other page versions for faster
access. That hasn't been implemented as of this writing, though.
### Compute node
Stateless Postgres node that stores data in pageserver.
@@ -54,10 +62,10 @@ Stateless Postgres node that stores data in pageserver.
### Garbage collection
The process of removing old on-disk layers that are not needed by any timeline anymore.
### Fork
Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map.
Each PostgreSQL fork is considered a separate relish.
### Layer
@@ -72,15 +80,15 @@ are immutable. See pageserver/src/layered_repository/README.md for more.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
files are called "layer files". Each file corresponds to one RELISH_SEG_SIZE
segment of a PostgreSQL relation fork. There are two kinds of layer
files: image files and delta files. An image file contains a
"snapshot" of the segment at a particular LSN, and a delta file
contains WAL records applicable to the segment, in a range of LSNs.
files are called "layer files". There are two kinds of layer files:
image files and delta files. An image file contains a "snapshot" of a
range of keys at a particular LSN, and a delta file contains WAL
records applicable to a range of keys, in a range of LSNs.
### Layer map
The layer map tracks what layers exist for all the relishes in a timeline.
The layer map tracks what layers exist in a timeline.
### Layered repository
Zenith repository implementation that keeps data in layers.
@@ -100,10 +108,10 @@ PostgreSQL LSNs and functions to monitor them:
* `pg_current_wal_lsn()` - Returns the current write-ahead log write location.
* `pg_current_wal_flush_lsn()` - Returns the current write-ahead log flush location.
* `pg_last_wal_receive_lsn()` - Returns the last write-ahead log location that has been received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically.
* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
[source PostgreSQL documentation](https://www.postgresql.org/docs/devel/functions-admin.html):
Zenith safekeeper LSNs. For more check [walkeeper/README_PROTO.md](/walkeeper/README_PROTO.md)
Zenith safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md)
* `CommitLSN`: position in WAL confirmed by quorum safekeepers.
* `RestartLSN`: position in WAL confirmed by all safekeepers.
* `FlushLSN`: part of WAL persisted to the disk by safekeeper.
@@ -149,14 +157,6 @@ and create new databases and accounts (control plane API in our case).
The generic term in PostgreSQL for all objects in a database that have a name and a list of attributes defined in a specific order.
### Relish
We call each relation and other file that is stored in the
repository a "relish". It comes from "rel"-ish, as in "kind of a
rel", because it covers relations as well as other things that are
not relations, but are treated similarly for the purposes of the
storage layer.
### Replication slot
@@ -173,33 +173,24 @@ One repository corresponds to one Tenant.
How much history do we need to keep around for PITR and read-only nodes?
### Segment (PostgreSQL)
NOTE: This is an overloaded term.
### Segment
A physical file that stores data for a given relation. File segments are
limited in size by a compile-time setting (1 gigabyte by default), so if a
relation exceeds that size, it is split into multiple segments.
### Segment (Layered Repository)
NOTE: This is an overloaded term.
Segment is a RELISH_SEG_SIZE slice of relish (identified by a SegmentTag).
### SLRU
SLRUs include pg_clog, pg_multixact/members, and
pg_multixact/offsets. There are other SLRUs in PostgreSQL, but
they don't need to be stored permanently (e.g. pg_subtrans),
or we do not support them in zenith yet (pg_commit_ts).
Each SLRU segment is considered a separate relish[].
### Tenant (Multitenancy)
Tenant represents a single customer, interacting with Zenith.
Wal redo[] activity, timelines[], layers[] are managed for each tenant independently.
One pageserver[] can serve multiple tenants at once.
One safekeeper
One safekeeper
See `docs/multitenancy.md` for more.

View File

@@ -12,7 +12,7 @@ Init empty pageserver using `initdb` in temporary directory.
`--storage_dest=FILE_PREFIX | S3_PREFIX |...` option defines object storage type, all other parameters are passed via env variables. Inspired by WAL-G style naming : https://wal-g.readthedocs.io/STORAGES/.
Save`storage_dest` and other parameters in config.
Save`storage_dest` and other parameters in config.
Push snapshots to `storage_dest` in background.
```
@@ -21,7 +21,7 @@ zenith start
```
#### 2. Restart pageserver (manually or crash-recovery).
Take `storage_dest` from pageserver config, start pageserver from latest snapshot in `storage_dest`.
Take `storage_dest` from pageserver config, start pageserver from latest snapshot in `storage_dest`.
Push snapshots to `storage_dest` in background.
```
@@ -32,7 +32,7 @@ zenith start
Start pageserver from existing snapshot.
Path to snapshot provided via `--snapshot_path=FILE_PREFIX | S3_PREFIX | ...`
Do not save `snapshot_path` and `snapshot_format` in config, as it is a one-time operation.
Save`storage_dest` parameters in config.
Save`storage_dest` parameters in config.
Push snapshots to `storage_dest` in background.
```
//I.e. we want to start zenith on top of existing $PGDATA and use s3 as a persistent storage.
@@ -42,15 +42,15 @@ zenith start
How to pass credentials needed for `snapshot_path`?
#### 4. Export.
Manually push snapshot to `snapshot_path` which differs from `storage_dest`
Manually push snapshot to `snapshot_path` which differs from `storage_dest`
Optionally set `snapshot_format`, which can be plain pgdata format or zenith format.
```
zenith export --snapshot_path=FILE_PREFIX --snapshot_format=pgdata
```
#### Notes and questions
- walkeeper s3_offload should use same (similar) syntax for storage. How to set it in UI?
- safekeeper s3_offload should use same (similar) syntax for storage. How to set it in UI?
- Why do we need `zenith init` as a separate command? Can't we init everything at first start?
- We can think of better names for all options.
- Export to plain postgres format will be useless, if we are not 100% compatible on page level.
I can recall at least one such difference - PD_WAL_LOGGED flag in pages.
I can recall at least one such difference - PD_WAL_LOGGED flag in pages.

View File

@@ -0,0 +1,69 @@
# Safekeeper gossip
Extracted from this [PR](https://github.com/zenithdb/rfcs/pull/13)
## Motivation
In some situations, safekeeper (SK) needs coordination with other SK's that serve the same tenant:
1. WAL deletion. SK needs to know what WAL was already safely replicated to delete it. Now we keep WAL indefinitely.
2. Deciding on who is sending WAL to the pageserver. Now sending SK crash may lead to a livelock where nobody sends WAL to the pageserver.
3. To enable SK to SK direct recovery without involving the compute
## Summary
Compute node has connection strings to each safekeeper. During each compute->safekeeper connection establishment, the compute node should pass down all that connection strings to each safekeeper. With that info, safekeepers may establish Postgres connections to each other and periodically send ping messages with LSN payload.
## Components
safekeeper, compute, compute<->safekeeper protocol, possibly console (group SK addresses)
## Proposed implementation
Each safekeeper can periodically ping all its peers and share connectivity and liveness info. If the ping was not receiver for, let's say, four ping periods, we may consider sending safekeeper as dead. That would mean some of the alive safekeepers should connect to the pageserver. One way to decide which one exactly: `make_connection = my_node_id == min(alive_nodes)`
Since safekeepers are multi-tenant, we may establish either per-tenant physical connections or per-safekeeper ones. So it makes sense to group "logical" connections between corresponding tenants on different nodes into a single physical connection. That means that we should implement an interconnect thread that maintains physical connections and periodically broadcasts info about all tenants.
Right now console may assign any 3 SK addresses to a given compute node. That may lead to a high number of gossip connections between SK's. Instead, we can assign safekeeper triples to the compute node. But if we want to "break"/" change" group by an ad-hoc action, we can do it.
### Corner cases
- Current safekeeper may be alive but may not have connectivity to the pageserver
To address that, we need to gossip visibility info. Based on that info, we may define SK as alive only when it can connect to the pageserver.
- Current safekeeper may be alive but may not have connectivity with the compute node.
We may broadcast last_received_lsn and presence of compute connection and decide who is alive based on that.
- It is tricky to decide when to shut down gossip connections because we need to be sure that pageserver got all the committed (in the distributed sense, so local SK info is not enough) records, and it may never lose them. It is not a strict requirement since `--sync-safekeepers` that happen before the compute start will allow the pageserver to consume missing WAL, but it is better to do that in the background. So the condition may look like that: `majority_max(flush_lsn) == pageserver_s3_lsn` Here we rely on the two facts:
- that `--sync-safekeepers` happened after the compute shutdown, and it advanced local commit_lsn's allowing pageserver to consume that WAL.
- we wait for the `pageserver_s3_lsn` advancement to avoid pageserver's last_received_lsn/disk_consistent_lsn going backward due to the disk/hardware failure and subsequent S3 recovery
If those conditions are not met, we will have some gossip activity (but that may be okay).
## Pros/cons
Pros:
- distributed, does not introduce new services (like etcd), does not add console as a storage dependency
- lays the foundation for gossip-based recovery
Cons:
- Only compute knows a set of safekeepers, but they should communicate even without compute node. In case of safekeepers restart, we will lose that info and can't gossip anymore. Hence we can't trim some WAL tail until the compute node start. Also, it is ugly.
- If the console assigns a random set of safekeepers to each Postgres, we may end up in a situation where each safekeeper needs to have a connection with all other safekeepers. We can group safekeepers into isolated triples in the console to avoid that. Then "mixing" would happen only if we do rebalancing.
## Alternative implementation
We can have a selected node (e.g., console) with everybody reporting to it.
## Security implications
We don't increase the attack surface here. Communication can happen in a private network that is not exposed to users.
## Scalability implications
The only thing that may grow as we grow the number of computes is the number of gossip connections. But if we group safekeepers and assign a compute node to the random SK triple, the number of connections would be constant.

View File

@@ -0,0 +1,145 @@
# Why LSM trees?
In general, an LSM tree has the nice property that random updates are
fast, but the disk writes are sequential. When a new file is created,
it is immutable. New files are created and old ones are deleted, but
existing files are never modified. That fits well with storing the
files on S3.
Currently, we create a lot of small files. That is mostly a problem
with S3, because each GET/PUT operation is expensive, and LIST
operation only returns 1000 objects at a time, and isn't free
either. Currently, the files are "archived" together into larger
checkpoint files before they're uploaded to S3 to alleviate that
problem, but garbage collecting data from the archive files would be
difficult and we have not implemented it. This proposal addresses that
problem.
# Overview
```
^ LSN
|
| Memtable: +-----------------------------+
| | |
| +-----------------------------+
|
|
| L0: +-----------------------------+
| | |
| +-----------------------------+
|
| +-----------------------------+
| | |
| +-----------------------------+
|
| +-----------------------------+
| | |
| +-----------------------------+
|
| +-----------------------------+
| | |
| +-----------------------------+
|
|
| L1: +-------+ +-----+ +--+ +-+
| | | | | | | | |
| | | | | | | | |
| +-------+ +-----+ +--+ +-+
|
| +----+ +-----+ +--+ +----+
| | | | | | | | |
| | | | | | | | |
| +----+ +-----+ +--+ +----+
|
+--------------------------------------------------------------> Page ID
+---+
| | Layer file
+---+
```
# Memtable
When new WAL arrives, it is first put into the Memtable. Despite the
name, the Memtable is not a purely in-memory data structure. It can
spill to a temporary file on disk if the system is low on memory, and
is accessed through a buffer cache.
If the page server crashes, the Memtable is lost. It is rebuilt by
processing again the WAL that's newer than the latest layer in L0.
The size of the Memtable is configured by the "checkpoint distance"
setting. Because anything that hasn't been flushed to disk and
uploaded to S3 yet needs to be kept in the safekeeper, the "checkpoint
distance" also determines the amount of WAL that needs to kept in the
safekeeper.
# L0
When the Memtable fills up, it is written out to a new file in L0. The
files are immutable; when a file is created, it is never
modified. Each file in L0 is roughly 1 GB in size (*). Like the
Memtable, each file in L0 covers the whole key range.
When enough files have been accumulated in L0, compaction
starts. Compaction processes all the files in L0 and reshuffles the
data to create a new set of files in L1.
(*) except in corner cases like if we want to shut down the page
server and want to flush out the memtable to disk even though it's not
full yet.
# L1
L1 consists of ~ 1 GB files like L0. But each file covers only part of
the overall key space, and a larger range of LSNs. This speeds up
searches. When you're looking for a given page, you need to check all
the files in L0, to see if they contain a page version for the requested
page. But in L1, you only need to check the files whose key range covers
the requested page. This is particularly important at cold start, when
checking a file means downloading it from S3.
Partitioning by key range also helps with garbage collection. If only a
part of the database is updated, we will accumulate more files for
the hot part in L1, and old files can be removed without affecting the
cold part.
# Image layers
So far, we've only talked about delta layers. In addition to the delta
layers, we create image layers, when "enough" WAL has been accumulated
for some part of the database. Each image layer covers a 1 GB range of
key space. It contains images of the pages at a single LSN, a snapshot
if you will.
The exact heuristic for what "enough" means is not clear yet. Maybe
create a new image layer when 10 GB of WAL has been accumulated for a
1 GB segment.
The image layers limit the number of layers that a search needs to
check. That put a cap on read latency, and it also allows garbage
collecting layers that are older than the GC horizon.
# Partitioning scheme
When compaction happens and creates a new set of files in L1, how do
we partition the data into the files?
- Goal is that each file is ~ 1 GB in size
- Try to match partition boundaries at relation boundaries. (See [1]
for how PebblesDB does this, and for why that's important)
- Greedy algorithm
# Additional Reading
[1] Paper on PebblesDB and how it does partitioning.
https://www.cs.utexas.edu/~rak/papers/sosp17-pebblesdb.pdf

View File

@@ -0,0 +1,295 @@
# Storage messaging
Created on 19.01.22
Initially created [here](https://github.com/zenithdb/rfcs/pull/16) by @kelvich.
That it is an alternative to (014-safekeeper-gossip)[]
## Motivation
As in 014-safekeeper-gossip we need to solve the following problems:
* Trim WAL on safekeepers
* Decide on which SK should push WAL to the S3
* Decide on which SK should forward WAL to the pageserver
* Decide on when to shut down SK<->pageserver connection
This RFC suggests a more generic and hopefully more manageable way to address those problems. However, unlike 014-safekeeper-gossip, it does not bring us any closer to safekeeper-to-safekeeper recovery but rather unties two sets of different issues we previously wanted to solve with gossip.
Also, with this approach, we would not need "call me maybe" anymore, and the pageserver will have all the data required to understand that it needs to reconnect to another safekeeper.
## Summary
Instead of p2p gossip, let's have a centralized broker where all the storage nodes report per-timeline state. Each storage node should have a `--broker-url=1.2.3.4` CLI param.
Here I propose two ways to do that. After a lot of arguing with myself, I'm leaning towards the etcd approach. My arguments for it are in the pros/cons section. Both options require adding a Grpc client in our codebase either directly or as an etcd dependency.
## Non-goals
That RFC does *not* suggest moving the compute to pageserver and compute to safekeeper mappings out of the console. The console is still the only place in the cluster responsible for the persistency of that info. So I'm implying that each pageserver and safekeeper exactly knows what timelines he serves, as it currently is. We need some mechanism for a new pageserver to discover mapping info, but that is out of the scope of this RFC.
## Impacted components
pageserver, safekeeper
adds either etcd or console as a storage dependency
## Possible implementation: custom message broker in the console
We've decided to go with an etcd approach instead of the message broker.
<details closed>
<summary>Original suggestion</summary>
<br>
We can add a Grpc service in the console that acts as a message broker since the console knows the addresses of all the components. The broker can ignore the payload and only redirect messages. So, for example, each safekeeper may send a message to the peering safekeepers or to the pageserver responsible for a given timeline.
Message format could be `{sender, destination, payload}`.
The destination is either:
1. `sk_#{tenant}_#{timeline}` -- to be broadcasted on all safekeepers, responsible for that timeline, or
2. `pserver_#{tenant}_#{timeline}` -- to be broadcasted on all pageservers, responsible for that timeline
Sender is either:
1. `sk_#{sk_id}`, or
2. `pserver_#{pserver_id}`
I can think of the following behavior to address our original problems:
* WAL trimming
Each safekeeper periodically broadcasts `(write_lsn, commit_lsn)` to all peering (peering == responsible for that timeline) safekeepers
* Decide on which SK should push WAL to the S3
Each safekeeper periodically broadcasts `i_am_alive_#{current_timestamp}` message to all peering safekeepers. That way, safekeepers may maintain the vector of alive peers (loose one, with false negatives). Alive safekeeper with the minimal id pushes data to S3.
* Decide on which SK should forward WAL to the pageserver
Each safekeeper periodically sends (write_lsn, commit_lsn, compute_connected) to the relevant pageservers. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`.
Pageserver connection to the safekeeper triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore.
Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection).
* Decide on when to shutdown sk<->pageserver connection
Again, pageserver would have all the info to understand when to shut down the safekeeper connection.
### Scalability
One node is enough (c) No, seriously, it is enough.
### High Availability
Broker lives in the console, so we can rely on k8s maintaining the console app alive.
If the console is down, we won't trim WAL and reconnect the pageserver to another safekeeper. But, at the same, if the console is down, we already can't accept new compute connections and start stopped computes, so we are making things a bit worse, but not dramatically.
### Interactions
```
.________________.
sk_1 <-> | | <-> pserver_1
... | Console broker | ...
sk_n <-> |________________| <-> pserver_m
```
</details>
## Implementation: etcd state store
Alternatively, we can set up `etcd` and maintain the following data structure in it:
```ruby
"compute_#{tenant}_#{timeline}" => {
safekeepers => {
"sk_#{sk_id}" => {
write_lsn: "0/AEDF130",
commit_lsn: "0/AEDF100",
compute_connected: true,
last_updated: 1642621138,
},
}
}
```
As etcd doesn't support field updates in the nested objects that translates to the following set of keys:
```ruby
"compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/write_lsn",
"compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/commit_lsn",
...
```
Each storage node can subscribe to the relevant sets of keys and maintain a local view of that structure. So in terms of the data flow, everything is the same as in the previous approach. Still, we can avoid implementing the message broker and prevent runtime storage dependency on a console.
### Safekeeper address discovery
During the startup safekeeper should publish the address he is listening on as the part of `{"sk_#{sk_id}" => ip_address}`. Then the pageserver can resolve `sk_#{sk_id}` to the actual address. This way it would work both locally and in the cloud setup. Safekeeper should have `--advertised-address` CLI option so that we can listen on e.g. 0.0.0.0 but advertize something more useful.
### Safekeeper behavior
For each timeline safekeeper periodically broadcasts `compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/*` fields. It subscribes to changes of `compute_#{tenant}_#{timeline}` -- that way safekeeper will have an information about peering safekeepers.
That amount of information is enough to properly trim WAL. To decide on who is pushing the data to S3 safekeeper may use etcd leases or broadcast a timestamp and hence track who is alive.
### Pageserver behavior
Pageserver subscribes to `compute_#{tenant}_#{timeline}` for each tenant it owns. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`.
Pageserver connection to the safekeeper can be triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore.
As an alternative to compute_connected, we can track timestamp of the latest message arrived to safekeeper from compute. Usually compute broadcasts KeepAlive to all safekeepers every second, so it'll be updated every second when connection is ok. Then the connection can be considered down when this timestamp isn't updated for a several seconds.
This will help to faster detect issues with safekeeper (and switch to another) in the following cases:
when compute failed but TCP connection stays alive until timeout (usually about a minute)
when safekeeper failed and didn't set compute_connected to false
Another way to deal with [2] is to process (write_lsn, commit_lsn, compute_connected) as a KeepAlive on the pageserver side and detect issues when sk_id don't send anything for some time. This way is fully compliant to this RFC.
Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection).
### Interactions
```
.________________.
sk_1 <-> | | <-> pserver_1
... | etcd | ...
sk_n <-> |________________| <-> pserver_m
```
### Sequence diagrams for different workflows
#### Cluster startup
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
PS1->>M: subscribe to updates to state of timeline N
C->>+SK1: WAL push
loop constantly update current lsns
SK1->>-M: I'm at lsn A
end
C->>+SK2: WAL push
loop constantly update current lsns
SK2->>-M: I'm at lsn B
end
C->>+SK3: WAL push
loop constantly update current lsns
SK3->>-M: I'm at lsn C
end
loop request pages
C->>+PS1: get_page@lsn
PS1->>-C: page image
end
M->>PS1: New compute appeared for timeline N. SK1 at A, SK2 at B, SK3 at C
note over PS1: Say SK1 at A=200, SK2 at B=150 SK3 at C=100 <br> so connect to SK1 because it is the most up to date one
PS1->>SK1: start replication
```
#### Behavour of services during typical operations
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
note over C,M: Scenario 1: Pageserver checkpoint
note over PS1: Upload data to S3
PS1->>M: Update remote consistent lsn
M->>SK1: propagate remote consistent lsn update
note over SK1: truncate WAL up to remote consistent lsn
M->>SK2: propagate remote consistent lsn update
note over SK2: truncate WAL up to remote consistent lsn
M->>SK3: propagate remote consistent lsn update
note over SK3: truncate WAL up to remote consistent lsn
note over C,M: Scenario 2: SK1 finds itself lagging behind MAX(150 (SK2), 200 (SK2)) - 100 (SK1) > THRESHOLD
SK1->>SK2: Fetch WAL delta between 100 (SK1) and 200 (SK2)
note over C,M: Scenario 3: PS1 detects that SK1 is lagging behind: Connection from SK1 is broken or there is no messages from it in 30 seconds.
note over PS1: e.g. SK2 is at 150, SK3 is at 100, chose SK2 as a new replication source
PS1->>SK2: start replication
```
#### Behaviour during timeline relocation
```mermaid
sequenceDiagram
autonumber
participant C as Compute
participant SK1
participant SK2
participant SK3
participant PS1
participant PS2
participant O as Orchestrator
participant M as Metadata Service
note over C,M: Timeline is being relocated from PS1 to PS2
O->>+PS2: Attach timeline
PS2->>-O: 202 Accepted if timeline exists in S3
note over PS2: Download timeline from S3
note over O: Poll for timeline download (or subscribe to metadata service)
loop wait for attach to complete
O->>PS2: timeline detail should answer that timeline is ready
end
PS2->>M: Register downloaded timeline
PS2->>M: Get safekeepers for timeline, subscribe to changes
PS2->>SK1: Start replication to catch up
note over O: PS2 catched up, time to switch compute
O->>C: Restart compute with new pageserver url in config
note over C: Wal push is restarted
loop request pages
C->>+PS2: get_page@lsn
PS2->>-C: page image
end
O->>PS1: detach timeline
note over C,M: Scenario 1: Attach call failed
O--xPS2: Attach timeline
note over O: The operation can be safely retried, <br> if we hit some threshold we can try another pageserver
note over C,M: Scenario 2: Attach succeeded but pageserver failed to download the data or start replication
loop wait for attach to complete
O--xPS2: timeline detail should answer that timeline is ready
end
note over O: Can wait for a timeout, and then try another pageserver <br> there should be a limit on number of different pageservers to try
note over C,M: Scenario 3: Detach fails
O--xPS1: Detach timeline
note over O: can be retried, if continues to fail might lead to data duplication in s3
```
# Pros/cons
## Console broker/etcd vs gossip:
Gossip pros:
* gossip allows running storage without the console or etcd
Console broker/etcd pros:
* simpler
* solves "call me maybe" as well
* avoid possible N-to-N connection issues with gossip without grouping safekeepers in pre-defined triples
## Console broker vs. etcd:
Initially, I wanted to avoid etcd as a dependency mostly because I've seen how painful for Clickhouse was their ZooKeeper dependency: in each chat, at each conference, people were complaining about configuration and maintenance barriers with ZooKeeper. It was that bad that ClickHouse re-implemented ZooKeeper to embed it: https://clickhouse.com/docs/en/operations/clickhouse-keeper/.
But with an etcd we are in a bit different situation:
1. We don't need persistency and strong consistency guarantees for the data we store in the etcd
2. etcd uses Grpc as a protocol, and messages are pretty simple
So it looks like implementing in-mem store with etcd interface is straightforward thing _if we will want that in future_. At the same time, we can avoid implementing it right now, and we will be able to run local zenith installation with etcd running somewhere in the background (as opposed to building and running console, which in turn requires Postgres).

View File

@@ -0,0 +1,79 @@
Cluster size limits
==================
## Summary
One of the resource consumption limits for free-tier users is a cluster size limit.
To enforce it, we need to calculate the timeline size and check if the limit is reached before relation create/extend operations.
If the limit is reached, the query must fail with some meaningful error/warning.
We may want to exempt some operations from the quota to allow users free space to fit back into the limit.
The stateless compute node that performs validation is separate from the storage that calculates the usage, so we need to exchange cluster size information between those components.
## Motivation
Limit the maximum size of a PostgreSQL instance to limit free tier users (and other tiers in the future).
First of all, this is needed to control our free tier production costs.
Another reason to limit resources is risk management — we haven't (fully) tested and optimized zenith for big clusters,
so we don't want to give users access to the functionality that we don't think is ready.
## Components
* pageserver - calculate the size consumed by a timeline and add it to the feedback message.
* safekeeper - pass feedback message from pageserver to compute.
* compute - receive feedback message, enforce size limit based on GUC `zenith.max_cluster_size`.
* console - set and update `zenith.max_cluster_size` setting
## Proposed implementation
First of all, it's necessary to define timeline size.
The current approach is to count all data, including SLRUs. (not including WAL)
Here we think of it as a physical disk underneath the Postgres cluster.
This is how the `LOGICAL_TIMELINE_SIZE` metric is implemented in the pageserver.
Alternatively, we could count only relation data. As in pg_database_size().
This approach is somewhat more user-friendly because it is the data that is really affected by the user.
On the other hand, it puts us in a weaker position than other services, i.e., RDS.
We will need to refactor the timeline_size counter or add another counter to implement it.
Timeline size is updated during wal digestion. It is not versioned and is valid at the last_received_lsn moment.
Then this size should be reported to compute node.
`current_timeline_size` value is included in the walreceiver's custom feedback message: `ZenithFeedback.`
(PR about protocol changes https://github.com/zenithdb/zenith/pull/1037).
This message is received by the safekeeper and propagated to compute node as a part of `AppendResponse`.
Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable.
And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > zenith.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so.
(see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html))
TODO:
We can allow autovacuum processes to bypass this check, simply checking `IsAutoVacuumWorkerProcess()`.
It would be nice to allow manual VACUUM and VACUUM FULL to bypass the check, but it's uneasy to distinguish these operations at the low level.
See issues https://github.com/neondatabase/neon/issues/1245
https://github.com/zenithdb/zenith/issues/1445
TODO:
We should warn users if the limit is soon to be reached.
### **Reliability, failure modes and corner cases**
1. `current_timeline_size` is valid at the last received and digested by pageserver lsn.
If pageserver lags behind compute node, `current_timeline_size` will lag too. This lag can be tuned using backpressure, but it is not expected to be 0 all the time.
So transactions that happen in this lsn range may cause limit overflow. Especially operations that generate (i.e., CREATE DATABASE) or free (i.e., TRUNCATE) a lot of data pages while generating a small amount of WAL. Are there other operations like this?
Currently, CREATE DATABASE operations are restricted in the console. So this is not an issue.
### **Security implications**
We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM.
Malicious users may change the `zenith.max_cluster_size`, so we need an extra size limit check.
To cover this case, we also monitor the compute node size in the console.

View File

@@ -68,11 +68,15 @@ S3.
The unit is # of bytes.
#### checkpoint_period
#### compaction_period
The pageserver checks whether `checkpoint_distance` has been reached
every `checkpoint_period` seconds. Default is 1 s, which should be
fine.
Every `compaction_period` seconds, the page server checks if
maintenance operations, like compaction, are needed on the layer
files. Default is 1 s, which should be fine.
#### compaction_target_size
File sizes for L0 delta and L1 image layers. Default is 128MB.
#### gc_horizon
@@ -85,6 +89,14 @@ away.
Interval at which garbage collection is triggered. Default is 100 s.
#### image_creation_threshold
L0 delta layer threshold for L1 iamge layer creation. Default is 3.
#### pitr_interval
WAL retention duration for PITR branching. Default is 30 days.
#### initial_superuser_name
Name of the initial superuser role, passed to initdb when a new tenant
@@ -156,6 +168,9 @@ access_key_id = 'SOMEKEYAAAAASADSAH*#'
# Secret access key to connect to the bucket ("password" part of the credentials)
secret_access_key = 'SOMEsEcReTsd292v'
# S3 API query limit to avoid getting errors/throttling from AWS.
concurrency_limit = 100
```
###### General remote storage configuration
@@ -167,8 +182,8 @@ Besides, there are parameters common for all types of remote storage that can be
```toml
[remote_storage]
# Max number of concurrent connections to open for uploading to or downloading from the remote storage.
max_concurrent_sync = 100
# Max number of concurrent timeline synchronized (layers uploaded or downloaded) with the remote storage at the same time.
max_concurrent_timelines_sync = 50
# Max number of errors a single task can have before it's considered failed and not attempted to run anymore.
max_sync_errors = 10

View File

@@ -28,12 +28,7 @@ The pageserver has a few different duties:
- Receive WAL from the WAL service and decode it.
- Replay WAL that's applicable to the chunks that the Page Server maintains
For more detailed info, see `/pageserver/README`
`/postgres_ffi`:
Utility functions for interacting with PostgreSQL file formats.
Misc constants, copied from PostgreSQL headers.
For more detailed info, see [/pageserver/README](/pageserver/README.md)
`/proxy`:
@@ -57,29 +52,38 @@ PostgreSQL extension that implements storage manager API and network communicati
PostgreSQL extension that contains functions needed for testing and debugging.
`/walkeeper`:
`/safekeeper`:
The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
It acts as a holding area and redistribution center for recently generated WAL.
For more detailed info, see `/walkeeper/README`
For more detailed info, see [/safekeeper/README](/safekeeper/README.md)
`/workspace_hack`:
The workspace_hack crate exists only to pin down some dependencies.
We use [cargo-hakari](https://crates.io/crates/cargo-hakari) for automation.
`/zenith`
Main entry point for the 'zenith' CLI utility.
TODO: Doesn't it belong to control_plane?
`/zenith_metrics`:
`/libs`:
Unites granular neon helper crates under the hood.
`/libs/postgres_ffi`:
Utility functions for interacting with PostgreSQL file formats.
Misc constants, copied from PostgreSQL headers.
`/libs/utils`:
Generic helpers that are shared between other crates in this repository.
A subject for future modularization.
`/libs/metrics`:
Helpers for exposing Prometheus metrics from the server.
`/zenith_utils`:
Helpers that are shared between other crates in this repository.
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.

View File

@@ -1,5 +1,5 @@
[package]
name = "zenith_metrics"
name = "metrics"
version = "0.1.0"
edition = "2021"
@@ -8,3 +8,4 @@ prometheus = {version = "0.13", default_features=false} # removes protobuf depen
libc = "0.2"
lazy_static = "1.4"
once_cell = "1.8.0"
workspace_hack = { version = "0.1", path = "../../workspace_hack" }

View File

@@ -8,8 +8,8 @@ use std::io::{Read, Result, Write};
///
/// ```
/// # use std::io::{Result, Read};
/// # use zenith_metrics::{register_int_counter, IntCounter};
/// # use zenith_metrics::CountedReader;
/// # use metrics::{register_int_counter, IntCounter};
/// # use metrics::CountedReader;
/// #
/// # lazy_static::lazy_static! {
/// # static ref INT_COUNTER: IntCounter = register_int_counter!(
@@ -83,8 +83,8 @@ impl<T: Read> Read for CountedReader<'_, T> {
///
/// ```
/// # use std::io::{Result, Write};
/// # use zenith_metrics::{register_int_counter, IntCounter};
/// # use zenith_metrics::CountedWriter;
/// # use metrics::{register_int_counter, IntCounter};
/// # use metrics::CountedWriter;
/// #
/// # lazy_static::lazy_static! {
/// # static ref INT_COUNTER: IntCounter = register_int_counter!(

View File

@@ -17,8 +17,8 @@ log = "0.4.14"
memoffset = "0.6.2"
thiserror = "1.0"
serde = { version = "1.0", features = ["derive"] }
workspace_hack = { path = "../workspace_hack" }
zenith_utils = { path = "../zenith_utils" }
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[build-dependencies]
bindgen = "0.59.1"

View File

@@ -88,8 +88,8 @@ fn main() {
// 'pg_config --includedir-server' would perhaps be the more proper way to find it,
// but this will do for now.
//
.clang_arg("-I../tmp_install/include/server")
.clang_arg("-I../tmp_install/include/postgresql/server")
.clang_arg("-I../../tmp_install/include/server")
.clang_arg("-I../../tmp_install/include/postgresql/server")
//
// Finish the builder and generate the bindings.
//

View File

@@ -43,7 +43,7 @@ impl ControlFileData {
/// Interpret a slice of bytes as a Postgres control file.
///
pub fn decode(buf: &[u8]) -> Result<ControlFileData> {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
// Check that the slice has the expected size. The control file is
// padded with zeros up to a 512 byte sector size, so accept a
@@ -77,7 +77,7 @@ impl ControlFileData {
///
/// The CRC is recomputed to match the contents of the fields.
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
// Serialize into a new buffer.
let b = self.ser().unwrap();

View File

@@ -24,6 +24,9 @@ pub const VISIBILITYMAP_FORKNUM: u8 = 2;
pub const INIT_FORKNUM: u8 = 3;
// From storage_xlog.h
pub const XLOG_SMGR_CREATE: u8 = 0x10;
pub const XLOG_SMGR_TRUNCATE: u8 = 0x20;
pub const SMGR_TRUNCATE_HEAP: u32 = 0x0001;
pub const SMGR_TRUNCATE_VM: u32 = 0x0002;
pub const SMGR_TRUNCATE_FSM: u32 = 0x0004;
@@ -113,7 +116,6 @@ pub const XACT_XINFO_HAS_TWOPHASE: u32 = 1u32 << 4;
// From pg_control.h and rmgrlist.h
pub const XLOG_NEXTOID: u8 = 0x30;
pub const XLOG_SWITCH: u8 = 0x40;
pub const XLOG_SMGR_TRUNCATE: u8 = 0x20;
pub const XLOG_FPI_FOR_HINT: u8 = 0xA0;
pub const XLOG_FPI: u8 = 0xB0;
pub const DB_SHUTDOWNED: u32 = 1;

View File

@@ -4,7 +4,7 @@
//! This understands the WAL page and record format, enough to figure out where the WAL record
//! boundaries are, and to reassemble WAL records that cross page boundaries.
//!
//! This functionality is needed by both the pageserver and the walkeepers. The pageserver needs
//! This functionality is needed by both the pageserver and the safekeepers. The pageserver needs
//! to look deeper into the WAL records to also understand which blocks they modify, the code
//! for that is in pageserver/src/walrecord.rs
//!
@@ -18,7 +18,7 @@ use crc32c::*;
use log::*;
use std::cmp::min;
use thiserror::Error;
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
pub struct WalStreamDecoder {
lsn: Lsn,

View File

@@ -28,7 +28,7 @@ use std::io::prelude::*;
use std::io::SeekFrom;
use std::path::{Path, PathBuf};
use std::time::SystemTime;
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
pub const XLOG_FNAME_LEN: usize = 24;
pub const XLOG_BLCKSZ: usize = 8192;
@@ -351,17 +351,17 @@ pub fn main() {
impl XLogRecord {
pub fn from_slice(buf: &[u8]) -> XLogRecord {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogRecord::des(buf).unwrap()
}
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogRecord {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogRecord::des_from(&mut buf.reader()).unwrap()
}
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
}
@@ -373,19 +373,19 @@ impl XLogRecord {
impl XLogPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogPageHeaderData {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogPageHeaderData::des_from(&mut buf.reader()).unwrap()
}
}
impl XLogLongPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogLongPageHeaderData {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogLongPageHeaderData::des_from(&mut buf.reader()).unwrap()
}
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
}
}
@@ -394,12 +394,12 @@ pub const SIZEOF_CHECKPOINT: usize = std::mem::size_of::<CheckPoint>();
impl CheckPoint {
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
}
pub fn decode(buf: &[u8]) -> Result<CheckPoint, anyhow::Error> {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
Ok(CheckPoint::des(buf)?)
}
@@ -477,7 +477,9 @@ mod tests {
#[test]
pub fn test_find_end_of_wal() {
// 1. Run initdb to generate some WAL
let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("..");
let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("..")
.join("..");
let data_dir = top_path.join("test_output/test_find_end_of_wal");
let initdb_path = top_path.join("tmp_install/bin/initdb");
let lib_path = top_path.join("tmp_install/lib");
@@ -495,7 +497,13 @@ mod tests {
.env("DYLD_LIBRARY_PATH", &lib_path)
.output()
.unwrap();
assert!(initdb_output.status.success());
assert!(
initdb_output.status.success(),
"initdb failed. Status: '{}', stdout: '{}', stderr: '{}'",
initdb_output.status,
String::from_utf8_lossy(&initdb_output.stdout),
String::from_utf8_lossy(&initdb_output.stderr),
);
// 2. Pick WAL generated by initdb
let wal_dir = data_dir.join("pg_wal");

View File

@@ -1,5 +1,5 @@
[package]
name = "zenith_utils"
name = "utils"
version = "0.1.0"
edition = "2021"
@@ -10,35 +10,35 @@ bytes = "1.0.1"
hyper = { version = "0.14.7", features = ["full"] }
lazy_static = "1.4.0"
pin-project-lite = "0.2.7"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
routerify = "3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
thiserror = "1.0"
tokio = { version = "1.11", features = ["macros"]}
tokio = { version = "1.17", features = ["macros"]}
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
nix = "0.23.0"
signal-hook = "0.3.10"
rand = "0.8.3"
jsonwebtoken = "7"
jsonwebtoken = "8"
hex = { version = "0.4.3", features = ["serde"] }
rustls = "0.19.1"
rustls-split = "0.2.1"
rustls = "0.20.2"
rustls-split = "0.3.0"
git-version = "0.3.5"
serde_with = "1.12.0"
zenith_metrics = { path = "../zenith_metrics" }
workspace_hack = { path = "../workspace_hack" }
metrics = { path = "../metrics" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[dev-dependencies]
byteorder = "1.4.3"
bytes = "1.0.1"
hex-literal = "0.3"
tempfile = "3.2"
webpki = "0.21"
criterion = "0.3"
rustls-pemfile = "0.2.1"
[[bench]]
name = "benchmarks"

View File

@@ -1,7 +1,7 @@
#![allow(unused)]
use criterion::{criterion_group, criterion_main, Criterion};
use zenith_utils::zid;
use utils::zid;
pub fn bench_zid_stringify(c: &mut Criterion) {
// Can only use public methods.

View File

@@ -0,0 +1,21 @@
#!/bin/bash
PG_BIN=$1
WAL_PATH=$2
DATA_DIR=$3
PORT=$4
SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-`
rm -fr $DATA_DIR
env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U zenith_admin -D $DATA_DIR --sysid=$SYSID
echo port=$PORT >> $DATA_DIR/postgresql.conf
REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-`
declare -i WAL_SIZE=$REDO_POS+114
$PG_BIN/pg_ctl -D $DATA_DIR -l logfile start
$PG_BIN/pg_ctl -D $DATA_DIR -l logfile stop -m immediate
cp $DATA_DIR/pg_wal/000000010000000000000001 .
cp $WAL_PATH/* $DATA_DIR/pg_wal/
if [ -f $DATA_DIR/pg_wal/*.partial ]
then
(cd $DATA_DIR/pg_wal ; for partial in \*.partial ; do mv $partial `basename $partial .partial` ; done)
fi
dd if=000000010000000000000001 of=$DATA_DIR/pg_wal/000000010000000000000001 bs=$WAL_SIZE count=1 conv=notrunc
rm -f 000000010000000000000001

View File

@@ -0,0 +1,20 @@
PG_BIN=$1
WAL_PATH=$2
DATA_DIR=$3
PORT=$4
SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-`
rm -fr $DATA_DIR /tmp/pg_wals
mkdir /tmp/pg_wals
env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U zenith_admin -D $DATA_DIR --sysid=$SYSID
echo port=$PORT >> $DATA_DIR/postgresql.conf
REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-`
declare -i WAL_SIZE=$REDO_POS+114
cp $WAL_PATH/* /tmp/pg_wals
if [ -f $DATA_DIR/pg_wal/*.partial ]
then
(cd /tmp/pg_wals ; for partial in \*.partial ; do mv $partial `basename $partial .partial` ; done)
fi
dd if=$DATA_DIR/pg_wal/000000010000000000000001 of=/tmp/pg_wals/000000010000000000000001 bs=$WAL_SIZE count=1 conv=notrunc
echo > $DATA_DIR/recovery.signal
rm -f $DATA_DIR/pg_wal/*
echo "restore_command = 'cp /tmp/pg_wals/%f %p'" >> $DATA_DIR/postgresql.conf

View File

@@ -5,7 +5,7 @@
/// For example, to calculate the smallest value among some integers:
///
/// ```
/// use zenith_utils::accum::Accum;
/// use utils::accum::Accum;
///
/// let values = [1, 2, 3];
///

View File

@@ -1,8 +1,6 @@
// For details about authentication see docs/authentication.md
// TODO there are two issues for our use case in jsonwebtoken library which will be resolved in next release
// The first one is that there is no way to disable expiration claim, but it can be excluded from validation, so use this as a workaround for now.
// Relevant issue: https://github.com/Keats/jsonwebtoken/issues/190
// The second one is that we wanted to use ed25519 keys, but they are also not supported until next version. So we go with RSA keys for now.
//
// TODO: use ed25519 keys
// Relevant issue: https://github.com/Keats/jsonwebtoken/issues/162
use serde;
@@ -59,19 +57,19 @@ pub fn check_permission(claims: &Claims, tenantid: Option<ZTenantId>) -> Result<
}
pub struct JwtAuth {
decoding_key: DecodingKey<'static>,
decoding_key: DecodingKey,
validation: Validation,
}
impl JwtAuth {
pub fn new(decoding_key: DecodingKey<'_>) -> Self {
pub fn new(decoding_key: DecodingKey) -> Self {
let mut validation = Validation::new(JWT_ALGORITHM);
// The default 'required_spec_claims' is 'exp'. But we don't want to require
// expiration.
validation.required_spec_claims = [].into();
Self {
decoding_key: decoding_key.into_static(),
validation: Validation {
algorithms: vec![JWT_ALGORITHM],
validate_exp: false,
..Default::default()
},
decoding_key,
validation,
}
}

View File

@@ -5,12 +5,11 @@ use anyhow::anyhow;
use hyper::header::AUTHORIZATION;
use hyper::{header::CONTENT_TYPE, Body, Request, Response, Server};
use lazy_static::lazy_static;
use metrics::{new_common_metric_name, register_int_counter, Encoder, IntCounter, TextEncoder};
use routerify::ext::RequestExt;
use routerify::RequestInfo;
use routerify::{Middleware, Router, RouterBuilder, RouterService};
use tracing::info;
use zenith_metrics::{new_common_metric_name, register_int_counter, IntCounter};
use zenith_metrics::{Encoder, TextEncoder};
use std::future::Future;
use std::net::TcpListener;
@@ -36,7 +35,7 @@ async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body
let mut buffer = vec![];
let encoder = TextEncoder::new();
let metrics = zenith_metrics::gather();
let metrics = metrics::gather();
encoder.encode(&metrics, &mut buffer).unwrap();
let response = Response::builder()
@@ -160,7 +159,7 @@ pub fn serve_thread_main<S>(
where
S: Future<Output = ()> + Send + Sync,
{
info!("Starting a http endpoint at {}", listener.local_addr()?);
info!("Starting an HTTP endpoint at {}", listener.local_addr()?);
// Create a Service from the router above to handle incoming requests.
let service = RouterService::new(router_builder.build().map_err(|err| anyhow!(err))?).unwrap();

View File

@@ -17,6 +17,9 @@ pub enum ApiError {
#[error("NotFound: {0}")]
NotFound(String),
#[error("Conflict: {0}")]
Conflict(String),
#[error(transparent)]
InternalServerError(#[from] anyhow::Error),
}
@@ -42,6 +45,9 @@ impl ApiError {
ApiError::NotFound(_) => {
HttpErrorBody::response_from_msg_and_status(self.to_string(), StatusCode::NOT_FOUND)
}
ApiError::Conflict(_) => {
HttpErrorBody::response_from_msg_and_status(self.to_string(), StatusCode::CONFLICT)
}
ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status(
err.to_string(),
StatusCode::INTERNAL_SERVER_ERROR,

View File

@@ -10,8 +10,8 @@ pub async fn json_request<T: for<'de> Deserialize<'de>>(
let whole_body = hyper::body::aggregate(request.body_mut())
.await
.map_err(ApiError::from_err)?;
Ok(serde_json::from_reader(whole_body.reader())
.map_err(|err| ApiError::BadRequest(format!("Failed to parse json request {}", err)))?)
serde_json::from_reader(whole_body.reader())
.map_err(|err| ApiError::BadRequest(format!("Failed to parse json request {}", err)))
}
pub fn json_response<T: Serialize>(

View File

@@ -1,4 +1,4 @@
//! zenith_utils is intended to be a place to put code that is shared
//! `utils` is intended to be a place to put code that is shared
//! between other crates in this repository.
#![allow(clippy::manual_range_contains)]
@@ -70,7 +70,7 @@ pub mod signals;
// So the build script will be run only when GIT_VERSION envvar has changed.
//
// Why not to use buildscript to get git commit sha directly without procmacro from different crate?
// Caching and workspaces complicates that. In case zenith_utils is not
// Caching and workspaces complicates that. In case `utils` is not
// recompiled due to caching then version may become outdated.
// git_version crate handles that case by introducing a dependency on .git internals via include_bytes! macro,
// so if we changed the index state git_version will pick that up and rerun the macro.

View File

@@ -304,8 +304,8 @@ impl PostgresBackend {
pub fn start_tls(&mut self) -> anyhow::Result<()> {
match self.stream.take() {
Some(Stream::Bidirectional(bidi_stream)) => {
let session = rustls::ServerSession::new(&self.tls_config.clone().unwrap());
self.stream = Some(Stream::Bidirectional(bidi_stream.start_tls(session)?));
let conn = rustls::ServerConnection::new(self.tls_config.clone().unwrap())?;
self.stream = Some(Stream::Bidirectional(bidi_stream.start_tls(conn)?));
Ok(())
}
stream => {
@@ -375,9 +375,8 @@ impl PostgresBackend {
}
AuthType::MD5 => {
rand::thread_rng().fill(&mut self.md5_salt);
let md5_salt = self.md5_salt;
self.write_message(&BeMessage::AuthenticationMD5Password(
&md5_salt,
self.md5_salt,
))?;
self.state = ProtoState::Authentication;
}

View File

@@ -100,6 +100,21 @@ pub struct FeExecuteMessage {
#[derive(Debug)]
pub struct FeCloseMessage {}
/// Retry a read on EINTR
///
/// This runs the enclosed expression, and if it returns
/// Err(io::ErrorKind::Interrupted), retries it.
macro_rules! retry_read {
( $x:expr ) => {
loop {
match $x {
Err(e) if e.kind() == io::ErrorKind::Interrupted => continue,
res => break res,
}
}
};
}
impl FeMessage {
/// Read one message from the stream.
/// This function returns `Ok(None)` in case of EOF.
@@ -107,7 +122,7 @@ impl FeMessage {
///
/// ```
/// # use std::io;
/// # use zenith_utils::pq_proto::FeMessage;
/// # use utils::pq_proto::FeMessage;
/// #
/// # fn process_message(msg: FeMessage) -> anyhow::Result<()> {
/// # Ok(())
@@ -141,12 +156,12 @@ impl FeMessage {
// Each libpq message begins with a message type byte, followed by message length
// If the client closes the connection, return None. But if the client closes the
// connection in the middle of a message, we will return an error.
let tag = match stream.read_u8().await {
let tag = match retry_read!(stream.read_u8().await) {
Ok(b) => b,
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e.into()),
};
let len = stream.read_u32().await?;
let len = retry_read!(stream.read_u32().await)?;
// The message length includes itself, so it better be at least 4
let bodylen = len
@@ -207,7 +222,7 @@ impl FeStartupPacket {
// reading 4 bytes, to be precise), return None to indicate that the connection
// was closed. This matches the PostgreSQL server's behavior, which avoids noise
// in the log if the client opens connection but closes it immediately.
let len = match stream.read_u32().await {
let len = match retry_read!(stream.read_u32().await) {
Ok(len) => len as usize,
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e.into()),
@@ -217,7 +232,7 @@ impl FeStartupPacket {
bail!("invalid message length");
}
let request_code = stream.read_u32().await?;
let request_code = retry_read!(stream.read_u32().await)?;
// the rest of startup packet are params
let params_len = len - 8;
@@ -401,7 +416,8 @@ fn read_null_terminated(buf: &mut Bytes) -> anyhow::Result<Bytes> {
#[derive(Debug)]
pub enum BeMessage<'a> {
AuthenticationOk,
AuthenticationMD5Password(&'a [u8; 4]),
AuthenticationMD5Password([u8; 4]),
AuthenticationSasl(BeAuthenticationSaslMessage<'a>),
AuthenticationCleartextPassword,
BackendKeyData(CancelKeyData),
BindComplete,
@@ -429,6 +445,13 @@ pub enum BeMessage<'a> {
KeepAlive(WalSndKeepAlive),
}
#[derive(Debug)]
pub enum BeAuthenticationSaslMessage<'a> {
Methods(&'a [&'a str]),
Continue(&'a [u8]),
Final(&'a [u8]),
}
#[derive(Debug)]
pub enum BeParameterStatusMessage<'a> {
Encoding(&'a str),
@@ -611,6 +634,32 @@ impl<'a> BeMessage<'a> {
.unwrap(); // write into BytesMut can't fail
}
BeMessage::AuthenticationSasl(msg) => {
buf.put_u8(b'R');
write_body(buf, |buf| {
use BeAuthenticationSaslMessage::*;
match msg {
Methods(methods) => {
buf.put_i32(10); // Specifies that SASL auth method is used.
for method in methods.iter() {
write_cstr(method.as_bytes(), buf)?;
}
buf.put_u8(0); // zero terminator for the list
}
Continue(extra) => {
buf.put_i32(11); // Continue SASL auth.
buf.put_slice(extra);
}
Final(extra) => {
buf.put_i32(12); // Send final SASL message.
buf.put_slice(extra);
}
}
Ok::<_, io::Error>(())
})
.unwrap()
}
BeMessage::BackendKeyData(key_data) => {
buf.put_u8(b'K');
write_body(buf, |buf| {

View File

@@ -4,7 +4,7 @@ use std::{
sync::Arc,
};
use rustls::Session;
use rustls::Connection;
/// Wrapper supporting reads of a shared TcpStream.
pub struct ArcTcpRead(Arc<TcpStream>);
@@ -56,7 +56,7 @@ impl BufStream {
pub enum ReadStream {
Tcp(BufReader<ArcTcpRead>),
Tls(rustls_split::ReadHalf<rustls::ServerSession>),
Tls(rustls_split::ReadHalf),
}
impl io::Read for ReadStream {
@@ -79,7 +79,7 @@ impl ReadStream {
pub enum WriteStream {
Tcp(Arc<TcpStream>),
Tls(rustls_split::WriteHalf<rustls::ServerSession>),
Tls(rustls_split::WriteHalf),
}
impl WriteStream {
@@ -107,11 +107,11 @@ impl io::Write for WriteStream {
}
}
type TlsStream<T> = rustls::StreamOwned<rustls::ServerSession, T>;
type TlsStream<T> = rustls::StreamOwned<rustls::ServerConnection, T>;
pub enum BidiStream {
Tcp(BufStream),
/// This variant is boxed, because [`rustls::ServerSession`] is quite larger than [`BufStream`].
/// This variant is boxed, because [`rustls::ServerConnection`] is quite larger than [`BufStream`].
Tls(Box<TlsStream<BufStream>>),
}
@@ -127,7 +127,7 @@ impl BidiStream {
if how == Shutdown::Read {
tls_boxed.sock.get_ref().shutdown(how)
} else {
tls_boxed.sess.send_close_notify();
tls_boxed.conn.send_close_notify();
let res = tls_boxed.flush();
tls_boxed.sock.get_ref().shutdown(how)?;
res
@@ -154,19 +154,23 @@ impl BidiStream {
// TODO would be nice to avoid the Arc here
let socket = Arc::try_unwrap(reader.into_inner().0).unwrap();
let (read_half, write_half) =
rustls_split::split(socket, tls_boxed.sess, read_buf_cfg, write_buf_cfg);
let (read_half, write_half) = rustls_split::split(
socket,
Connection::Server(tls_boxed.conn),
read_buf_cfg,
write_buf_cfg,
);
(ReadStream::Tls(read_half), WriteStream::Tls(write_half))
}
}
}
pub fn start_tls(self, mut session: rustls::ServerSession) -> io::Result<Self> {
pub fn start_tls(self, mut conn: rustls::ServerConnection) -> io::Result<Self> {
match self {
Self::Tcp(mut stream) => {
session.complete_io(&mut stream)?;
assert!(!session.is_handshaking());
Ok(Self::Tls(Box::new(TlsStream::new(session, stream))))
conn.complete_io(&mut stream)?;
assert!(!conn.is_handshaking());
Ok(Self::Tls(Box::new(TlsStream::new(conn, stream))))
}
Self::Tls { .. } => Err(io::Error::new(
io::ErrorKind::InvalidInput,

View File

@@ -29,7 +29,7 @@ impl<S, T: Future> SyncFuture<S, T> {
/// Example:
///
/// ```
/// # use zenith_utils::sync::SyncFuture;
/// # use utils::sync::SyncFuture;
/// # use std::future::Future;
/// # use tokio::io::AsyncReadExt;
/// #

View File

@@ -2,7 +2,7 @@ use bytes::{Buf, BytesMut};
use hex_literal::hex;
use serde::Deserialize;
use std::io::Read;
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
#[derive(Debug, PartialEq, Deserialize)]
pub struct HeaderData {

View File

@@ -8,9 +8,8 @@ use std::{
use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use lazy_static::lazy_static;
use rustls::Session;
use zenith_utils::postgres_backend::{AuthType, Handler, PostgresBackend};
use utils::postgres_backend::{AuthType, Handler, PostgresBackend};
fn make_tcp_pair() -> (TcpStream, TcpStream) {
let listener = TcpListener::bind("127.0.0.1:0").unwrap();
@@ -23,11 +22,11 @@ fn make_tcp_pair() -> (TcpStream, TcpStream) {
lazy_static! {
static ref KEY: rustls::PrivateKey = {
let mut cursor = Cursor::new(include_bytes!("key.pem"));
rustls::internal::pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone()
rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone())
};
static ref CERT: rustls::Certificate = {
let mut cursor = Cursor::new(include_bytes!("cert.pem"));
rustls::internal::pemfile::certs(&mut cursor).unwrap()[0].clone()
rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone())
};
}
@@ -45,17 +44,23 @@ fn ssl() {
let ssl_response = client_sock.read_u8().unwrap();
assert_eq!(b'S', ssl_response);
let mut cfg = rustls::ClientConfig::new();
cfg.root_store.add(&CERT).unwrap();
let cfg = rustls::ClientConfig::builder()
.with_safe_defaults()
.with_root_certificates({
let mut store = rustls::RootCertStore::empty();
store.add(&CERT).unwrap();
store
})
.with_no_client_auth();
let client_config = Arc::new(cfg);
let dns_name = webpki::DNSNameRef::try_from_ascii_str("localhost").unwrap();
let mut session = rustls::ClientSession::new(&client_config, dns_name);
let dns_name = "localhost".try_into().unwrap();
let mut conn = rustls::ClientConnection::new(client_config, dns_name).unwrap();
session.complete_io(&mut client_sock).unwrap();
assert!(!session.is_handshaking());
conn.complete_io(&mut client_sock).unwrap();
assert!(!conn.is_handshaking());
let mut stream = rustls::Stream::new(&mut session, &mut client_sock);
let mut stream = rustls::Stream::new(&mut conn, &mut client_sock);
// StartupMessage
stream.write_u32::<BigEndian>(9).unwrap();
@@ -105,8 +110,10 @@ fn ssl() {
}
let mut handler = TestHandler { got_query: false };
let mut cfg = rustls::ServerConfig::new(rustls::NoClientAuth::new());
cfg.set_single_cert(vec![CERT.clone()], KEY.clone())
let cfg = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![CERT.clone()], KEY.clone())
.unwrap();
let tls_config = Some(Arc::new(cfg));
@@ -209,8 +216,10 @@ fn server_forces_ssl() {
}
let mut handler = TestHandler;
let mut cfg = rustls::ServerConfig::new(rustls::NoClientAuth::new());
cfg.set_single_cert(vec![CERT.clone()], KEY.clone())
let cfg = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![CERT.clone()], KEY.clone())
.unwrap();
let tls_config = Some(Arc::new(cfg));

View File

@@ -3,24 +3,33 @@ name = "pageserver"
version = "0.1.0"
edition = "2021"
[features]
# It is simpler infra-wise to have failpoints enabled by default
# It shouldnt affect perf in any way because failpoints
# are not placed in hot code paths
default = ["failpoints"]
profiling = ["pprof"]
failpoints = ["fail/failpoints"]
[dependencies]
bookfile = { git = "https://github.com/zenithdb/bookfile.git", branch="generic-readext" }
chrono = "0.4.19"
rand = "0.8.3"
regex = "1.4.5"
bytes = { version = "1.0.1", features = ['serde'] }
byteorder = "1.4.3"
futures = "0.3.13"
hex = "0.4.3"
hyper = "0.14"
itertools = "0.10.3"
lazy_static = "1.4.0"
log = "0.4.14"
clap = "3.0"
daemonize = "0.4.1"
tokio = { version = "1.11", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
tokio-util = { version = "0.7", features = ["io"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
tokio-stream = "0.1.8"
anyhow = { version = "1.0", features = ["backtrace"] }
crc32c = "0.6.0"
@@ -30,13 +39,14 @@ humantime = "2.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "1.12.0"
humantime-serde = "1.1.1"
pprof = { git = "https://github.com/neondatabase/pprof-rs.git", branch = "wallclock-profiling", features = ["flamegraph"], optional = true }
toml_edit = { version = "0.13", features = ["easy"] }
scopeguard = "1.1.0"
async-trait = "0.1"
const_format = "0.2.21"
tracing = "0.1.27"
tracing-futures = "0.2"
signal-hook = "0.3.10"
url = "2"
nix = "0.23"
@@ -44,13 +54,16 @@ once_cell = "1.8.0"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
rust-s3 = { version = "0.28", default-features = false, features = ["no-verify-ssl", "tokio-rustls-tls"] }
async-compression = {version = "0.3", features = ["zstd", "tokio"]}
rusoto_core = "0.47"
rusoto_s3 = "0.47"
async-trait = "0.1"
postgres_ffi = { path = "../postgres_ffi" }
zenith_metrics = { path = "../zenith_metrics" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { path = "../workspace_hack" }
zstd = "0.11.1"
postgres_ffi = { path = "../libs/postgres_ffi" }
metrics = { path = "../libs/metrics" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
[dev-dependencies]
hex-literal = "0.3"

View File

@@ -13,7 +13,7 @@ keeps track of WAL records which are not synced to S3 yet.
The Page Server consists of multiple threads that operate on a shared
repository of page versions:
```
| WAL
V
+--------------+
@@ -46,7 +46,7 @@ Legend:
---> Data flow
<---
```
Page Service
------------

View File

@@ -10,28 +10,29 @@
//! This module is responsible for creation of such tarball
//! from data stored in object storage.
//!
use anyhow::{Context, Result};
use anyhow::{ensure, Context, Result};
use bytes::{BufMut, BytesMut};
use log::*;
use std::fmt::Write as FmtWrite;
use std::io;
use std::io::Write;
use std::sync::Arc;
use std::time::SystemTime;
use tar::{Builder, EntryType, Header};
use tracing::*;
use crate::relish::*;
use crate::reltag::SlruKind;
use crate::repository::Timeline;
use crate::DatadirTimelineImpl;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::*;
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
/// This is short-living object only for the time of tarball creation,
/// created mostly to avoid passing a lot of parameters between various functions
/// used for constructing tarball.
pub struct Basebackup<'a> {
ar: Builder<&'a mut dyn Write>,
timeline: &'a Arc<dyn Timeline>,
timeline: &'a Arc<DatadirTimelineImpl>,
pub lsn: Lsn,
prev_record_lsn: Lsn,
}
@@ -46,7 +47,7 @@ pub struct Basebackup<'a> {
impl<'a> Basebackup<'a> {
pub fn new(
write: &'a mut dyn Write,
timeline: &'a Arc<dyn Timeline>,
timeline: &'a Arc<DatadirTimelineImpl>,
req_lsn: Option<Lsn>,
) -> Result<Basebackup<'a>> {
// Compute postgres doesn't have any previous WAL files, but the first
@@ -64,13 +65,14 @@ impl<'a> Basebackup<'a> {
// prev_lsn to Lsn(0) if we cannot provide the correct value.
let (backup_prev, backup_lsn) = if let Some(req_lsn) = req_lsn {
// Backup was requested at a particular LSN. Wait for it to arrive.
timeline.wait_lsn(req_lsn)?;
info!("waiting for {}", req_lsn);
timeline.tline.wait_lsn(req_lsn)?;
// If the requested point is the end of the timeline, we can
// provide prev_lsn. (get_last_record_rlsn() might return it as
// zero, though, if no WAL has been generated on this timeline
// yet.)
let end_of_timeline = timeline.get_last_record_rlsn();
let end_of_timeline = timeline.tline.get_last_record_rlsn();
if req_lsn == end_of_timeline.last {
(end_of_timeline.prev, req_lsn)
} else {
@@ -78,7 +80,7 @@ impl<'a> Basebackup<'a> {
}
} else {
// Backup was requested at end of the timeline.
let end_of_timeline = timeline.get_last_record_rlsn();
let end_of_timeline = timeline.tline.get_last_record_rlsn();
(end_of_timeline.prev, end_of_timeline.last)
};
@@ -115,21 +117,24 @@ impl<'a> Basebackup<'a> {
}
// Gather non-relational files from object storage pages.
for obj in self.timeline.list_nonrels(self.lsn)? {
match obj {
RelishTag::Slru { slru, segno } => {
self.add_slru_segment(slru, segno)?;
}
RelishTag::FileNodeMap { spcnode, dbnode } => {
self.add_relmap_file(spcnode, dbnode)?;
}
RelishTag::TwoPhase { xid } => {
self.add_twophase_file(xid)?;
}
_ => {}
for kind in [
SlruKind::Clog,
SlruKind::MultiXactOffsets,
SlruKind::MultiXactMembers,
] {
for segno in self.timeline.list_slru_segments(kind, self.lsn)? {
self.add_slru_segment(kind, segno)?;
}
}
// Create tablespace directories
for ((spcnode, dbnode), has_relmap_file) in self.timeline.list_dbdirs(self.lsn)? {
self.add_dbdir(spcnode, dbnode, has_relmap_file)?;
}
for xid in self.timeline.list_twophase_files(self.lsn)? {
self.add_twophase_file(xid)?;
}
// Generate pg_control and bootstrap WAL segment.
self.add_pgcontrol_file()?;
self.ar.finish()?;
@@ -141,28 +146,15 @@ impl<'a> Basebackup<'a> {
// Generate SLRU segment files from repository.
//
fn add_slru_segment(&mut self, slru: SlruKind, segno: u32) -> anyhow::Result<()> {
let seg_size = self
.timeline
.get_relish_size(RelishTag::Slru { slru, segno }, self.lsn)?;
if seg_size == None {
trace!(
"SLRU segment {}/{:>04X} was truncated",
slru.to_str(),
segno
);
return Ok(());
}
let nblocks = seg_size.unwrap();
let nblocks = self.timeline.get_slru_segment_size(slru, segno, self.lsn)?;
let mut slru_buf: Vec<u8> =
Vec::with_capacity(nblocks as usize * pg_constants::BLCKSZ as usize);
for blknum in 0..nblocks {
let img =
self.timeline
.get_page_at_lsn(RelishTag::Slru { slru, segno }, blknum, self.lsn)?;
assert!(img.len() == pg_constants::BLCKSZ as usize);
let img = self
.timeline
.get_slru_page_at_lsn(slru, segno, blknum, self.lsn)?;
ensure!(img.len() == pg_constants::BLCKSZ as usize);
slru_buf.extend_from_slice(&img);
}
@@ -176,16 +168,26 @@ impl<'a> Basebackup<'a> {
}
//
// Extract pg_filenode.map files from repository
// Along with them also send PG_VERSION for each database.
// Include database/tablespace directories.
//
fn add_relmap_file(&mut self, spcnode: u32, dbnode: u32) -> anyhow::Result<()> {
let img = self.timeline.get_page_at_lsn(
RelishTag::FileNodeMap { spcnode, dbnode },
0,
self.lsn,
)?;
let path = if spcnode == pg_constants::GLOBALTABLESPACE_OID {
// Each directory contains a PG_VERSION file, and the default database
// directories also contain pg_filenode.map files.
//
fn add_dbdir(
&mut self,
spcnode: u32,
dbnode: u32,
has_relmap_file: bool,
) -> anyhow::Result<()> {
let relmap_img = if has_relmap_file {
let img = self.timeline.get_relmap_file(spcnode, dbnode, self.lsn)?;
ensure!(img.len() == 512);
Some(img)
} else {
None
};
if spcnode == pg_constants::GLOBALTABLESPACE_OID {
let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes();
let header = new_tar_header("PG_VERSION", version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
@@ -193,26 +195,51 @@ impl<'a> Basebackup<'a> {
let header = new_tar_header("global/PG_VERSION", version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
String::from("global/pg_filenode.map") // filenode map for global tablespace
if let Some(img) = relmap_img {
// filenode map for global tablespace
let header = new_tar_header("global/pg_filenode.map", img.len() as u64)?;
self.ar.append(&header, &img[..])?;
} else {
warn!("global/pg_filenode.map is missing");
}
} else {
// User defined tablespaces are not supported. However, as
// a special case, if a tablespace/db directory is
// completely empty, we can leave it out altogether. This
// makes taking a base backup after the 'tablespace'
// regression test pass, because the test drops the
// created tablespaces after the tests.
//
// FIXME: this wouldn't be necessary, if we handled
// XLOG_TBLSPC_DROP records. But we probably should just
// throw an error on CREATE TABLESPACE in the first place.
if !has_relmap_file
&& self
.timeline
.list_rels(spcnode, dbnode, self.lsn)?
.is_empty()
{
return Ok(());
}
// User defined tablespaces are not supported
assert!(spcnode == pg_constants::DEFAULTTABLESPACE_OID);
ensure!(spcnode == pg_constants::DEFAULTTABLESPACE_OID);
// Append dir path for each database
let path = format!("base/{}", dbnode);
let header = new_tar_header_dir(&path)?;
self.ar.append(&header, &mut io::empty())?;
let dst_path = format!("base/{}/PG_VERSION", dbnode);
let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes();
let header = new_tar_header(&dst_path, version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
if let Some(img) = relmap_img {
let dst_path = format!("base/{}/PG_VERSION", dbnode);
let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes();
let header = new_tar_header(&dst_path, version_bytes.len() as u64)?;
self.ar.append(&header, version_bytes)?;
format!("base/{}/pg_filenode.map", dbnode)
let relmap_path = format!("base/{}/pg_filenode.map", dbnode);
let header = new_tar_header(&relmap_path, img.len() as u64)?;
self.ar.append(&header, &img[..])?;
}
};
assert!(img.len() == 512);
let header = new_tar_header(&path, img.len() as u64)?;
self.ar.append(&header, &img[..])?;
Ok(())
}
@@ -220,9 +247,7 @@ impl<'a> Basebackup<'a> {
// Extract twophase state files
//
fn add_twophase_file(&mut self, xid: TransactionId) -> anyhow::Result<()> {
let img = self
.timeline
.get_page_at_lsn(RelishTag::TwoPhase { xid }, 0, self.lsn)?;
let img = self.timeline.get_twophase_file(xid, self.lsn)?;
let mut buf = BytesMut::new();
buf.extend_from_slice(&img[..]);
@@ -242,11 +267,11 @@ impl<'a> Basebackup<'a> {
fn add_pgcontrol_file(&mut self) -> anyhow::Result<()> {
let checkpoint_bytes = self
.timeline
.get_page_at_lsn(RelishTag::Checkpoint, 0, self.lsn)
.get_checkpoint(self.lsn)
.context("failed to get checkpoint bytes")?;
let pg_control_bytes = self
.timeline
.get_page_at_lsn(RelishTag::ControlFile, 0, self.lsn)
.get_control_file(self.lsn)
.context("failed get control bytes")?;
let mut pg_control = ControlFileData::decode(&pg_control_bytes)?;
let mut checkpoint = CheckPoint::decode(&checkpoint_bytes)?;
@@ -267,7 +292,7 @@ impl<'a> Basebackup<'a> {
// add zenith.signal file
let mut zenith_signal = String::new();
if self.prev_record_lsn == Lsn(0) {
if self.lsn == self.timeline.get_ancestor_lsn() {
if self.lsn == self.timeline.tline.get_ancestor_lsn() {
write!(zenith_signal, "PREV LSN: none")?;
} else {
write!(zenith_signal, "PREV LSN: invalid")?;
@@ -291,7 +316,7 @@ impl<'a> Basebackup<'a> {
let wal_file_path = format!("pg_wal/{}", wal_file_name);
let header = new_tar_header(&wal_file_path, pg_constants::WAL_SEGMENT_SIZE as u64)?;
let wal_seg = generate_wal_segment(segno, pg_control.system_identifier);
assert!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE);
ensure!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE);
self.ar.append(&header, &wal_seg[..])?;
Ok(())
}

View File

@@ -4,9 +4,10 @@
use anyhow::Result;
use clap::{App, Arg};
use pageserver::layered_repository::dump_layerfile_from_path;
use pageserver::page_cache;
use pageserver::virtual_file;
use std::path::PathBuf;
use zenith_utils::GIT_VERSION;
use utils::GIT_VERSION;
fn main() -> Result<()> {
let arg_matches = App::new("Zenith dump_layerfile utility")
@@ -24,8 +25,9 @@ fn main() -> Result<()> {
// Basic initialization of things that don't change after startup
virtual_file::init(10);
page_cache::init(100);
dump_layerfile_from_path(&path)?;
dump_layerfile_from_path(&path, true)?;
Ok(())
}

View File

@@ -2,14 +2,6 @@
use std::{env, path::Path, str::FromStr};
use tracing::*;
use zenith_utils::{
auth::JwtAuth,
logging,
postgres_backend::AuthType,
tcp_listener,
zid::{ZTenantId, ZTimelineId},
GIT_VERSION,
};
use anyhow::{bail, Context, Result};
@@ -18,23 +10,36 @@ use daemonize::Daemonize;
use pageserver::{
config::{defaults::*, PageServerConf},
http, page_cache, page_service,
remote_storage::{self, SyncStartupData},
repository::TimelineSyncStatusUpdate,
tenant_mgr, thread_mgr,
http, page_cache, page_service, profiling, tenant_mgr, thread_mgr,
thread_mgr::ThreadKind,
timelines, virtual_file, LOG_FILE_NAME,
};
use zenith_utils::http::endpoint;
use zenith_utils::postgres_backend;
use zenith_utils::shutdown::exit_now;
use zenith_utils::signals::{self, Signal};
use utils::{
auth::JwtAuth,
http::endpoint,
logging,
postgres_backend::AuthType,
shutdown::exit_now,
signals::{self, Signal},
tcp_listener,
zid::{ZTenantId, ZTimelineId},
GIT_VERSION,
};
fn main() -> Result<()> {
zenith_metrics::set_common_metrics_prefix("pageserver");
fn version() -> String {
format!(
"{} profiling:{} failpoints:{}",
GIT_VERSION,
cfg!(feature = "profiling"),
fail::has_failpoints()
)
}
fn main() -> anyhow::Result<()> {
metrics::set_common_metrics_prefix("pageserver");
let arg_matches = App::new("Zenith page server")
.about("Materializes WAL stream to pages and serves them to the postgres")
.version(GIT_VERSION)
.version(&*version())
.arg(
Arg::new("daemonize")
.short('d')
@@ -116,7 +121,7 @@ fn main() -> Result<()> {
// We're initializing the repo, so there's no config file yet
DEFAULT_CONFIG_FILE
.parse::<toml_edit::Document>()
.expect("could not parse built-in config file")
.context("could not parse built-in config file")?
} else {
// Supplement the CLI arguments with the config file
let cfg_file_contents = std::fs::read_to_string(&cfg_file_path)
@@ -163,8 +168,7 @@ fn main() -> Result<()> {
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
page_cache::init(conf);
page_cache::init(conf.page_cache_size);
// Create repo and exit if init was requested
if init {
@@ -210,7 +214,9 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
// There shouldn't be any logging to stdin/stdout. Redirect it to the main log so
// that we will see any accidental manual fprintf's or backtraces.
let stdout = log_file.try_clone().unwrap();
let stdout = log_file
.try_clone()
.with_context(|| format!("Failed to clone log file '{:?}'", log_file))?;
let stderr = log_file;
let daemonize = Daemonize::new()
@@ -231,46 +237,8 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
let signals = signals::install_shutdown_handlers()?;
// Initialize repositories with locally available timelines.
// Timelines that are only partially available locally (remote storage has more data than this pageserver)
// are scheduled for download and added to the repository once download is completed.
let SyncStartupData {
remote_index,
local_timeline_init_statuses,
} = remote_storage::start_local_timeline_sync(conf)
.context("Failed to set up local files sync with external storage")?;
for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses {
// initialize local tenant
let repo = tenant_mgr::load_local_repo(conf, tenant_id, &remote_index);
for (timeline_id, init_status) in local_timeline_init_statuses {
match init_status {
remote_storage::LocalTimelineInitStatus::LocallyComplete => {
debug!("timeline {} for tenant {} is locally complete, registering it in repository", tenant_id, timeline_id);
// Lets fail here loudly to be on the safe side.
// XXX: It may be a better api to actually distinguish between repository startup
// and processing of newly downloaded timelines.
repo.apply_timeline_remote_sync_status_update(
timeline_id,
TimelineSyncStatusUpdate::Downloaded,
)
.with_context(|| {
format!(
"Failed to bootstrap timeline {} for tenant {}",
timeline_id, tenant_id
)
})?
}
remote_storage::LocalTimelineInitStatus::NeedsSync => {
debug!(
"timeline {} for tenant {} needs sync, \
so skipped for adding into repository until sync is finished",
tenant_id, timeline_id
);
}
}
}
}
// start profiler (if enabled)
let profiler_guard = profiling::init_profiler(conf);
// initialize authentication for incoming connections
let auth = match &conf.auth_type {
@@ -283,6 +251,8 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
};
info!("Using auth: {:#?}", conf.auth_type);
let remote_index = tenant_mgr::init_tenant_mgr(conf)?;
// Spawn a new thread for the http endpoint
// bind before launching separate thread so the error reported before startup exits
let auth_cloned = auth.clone();
@@ -291,8 +261,9 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
None,
None,
"http_endpoint_thread",
false,
move || {
let router = http::make_router(conf, auth_cloned, remote_index);
let router = http::make_router(conf, auth_cloned, remote_index)?;
endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher())
},
)?;
@@ -304,6 +275,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
None,
None,
"libpq endpoint thread",
false,
move || page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type),
)?;
@@ -313,6 +285,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
"Got {}. Terminating in immediate shutdown mode",
signal.name()
);
profiling::exit_profiler(conf, &profiler_guard);
std::process::exit(111);
}
@@ -321,38 +294,9 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
"Got {}. Terminating gracefully in fast shutdown mode",
signal.name()
);
shutdown_pageserver();
profiling::exit_profiler(conf, &profiler_guard);
pageserver::shutdown_pageserver();
unreachable!()
}
})
}
fn shutdown_pageserver() {
// Shut down the libpq endpoint thread. This prevents new connections from
// being accepted.
thread_mgr::shutdown_threads(Some(ThreadKind::LibpqEndpointListener), None, None);
// Shut down any page service threads.
postgres_backend::set_pgbackend_shutdown_requested();
thread_mgr::shutdown_threads(Some(ThreadKind::PageRequestHandler), None, None);
// Shut down all the tenants. This flushes everything to disk and kills
// the checkpoint and GC threads.
tenant_mgr::shutdown_all_tenants();
// Stop syncing with remote storage.
//
// FIXME: Does this wait for the sync thread to finish syncing what's queued up?
// Should it?
thread_mgr::shutdown_threads(Some(ThreadKind::StorageSync), None, None);
// Shut down the HTTP endpoint last, so that you can still check the server's
// status while it's shutting down.
thread_mgr::shutdown_threads(Some(ThreadKind::HttpEndpointListener), None, None);
// There should be nothing left, but let's be sure
thread_mgr::shutdown_threads(None, None, None);
info!("Shut down successfully completed");
std::process::exit(0);
}

View File

@@ -1,334 +0,0 @@
//! A CLI helper to deal with remote storage (S3, usually) blobs as archives.
//! See [`compression`] for more details about the archives.
use std::{collections::BTreeSet, path::Path};
use anyhow::{bail, ensure, Context};
use clap::{App, Arg};
use pageserver::{
layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME},
remote_storage::compression,
};
use tokio::{fs, io};
use zenith_utils::GIT_VERSION;
const LIST_SUBCOMMAND: &str = "list";
const ARCHIVE_ARG_NAME: &str = "archive";
const EXTRACT_SUBCOMMAND: &str = "extract";
const TARGET_DIRECTORY_ARG_NAME: &str = "target_directory";
const CREATE_SUBCOMMAND: &str = "create";
const SOURCE_DIRECTORY_ARG_NAME: &str = "source_directory";
#[tokio::main(flavor = "current_thread")]
async fn main() -> anyhow::Result<()> {
let arg_matches = App::new("pageserver zst blob [un]compressor utility")
.version(GIT_VERSION)
.subcommands(vec![
App::new(LIST_SUBCOMMAND)
.about("List the archive contents")
.arg(
Arg::new(ARCHIVE_ARG_NAME)
.required(true)
.takes_value(true)
.help("An archive to list the contents of"),
),
App::new(EXTRACT_SUBCOMMAND)
.about("Extracts the archive into the directory")
.arg(
Arg::new(ARCHIVE_ARG_NAME)
.required(true)
.takes_value(true)
.help("An archive to extract"),
)
.arg(
Arg::new(TARGET_DIRECTORY_ARG_NAME)
.required(false)
.takes_value(true)
.help("A directory to extract the archive into. Optional, will use the current directory if not specified"),
),
App::new(CREATE_SUBCOMMAND)
.about("Creates an archive with the contents of a directory (only the first level files are taken, metadata file has to be present in the same directory)")
.arg(
Arg::new(SOURCE_DIRECTORY_ARG_NAME)
.required(true)
.takes_value(true)
.help("A directory to use for creating the archive"),
)
.arg(
Arg::new(TARGET_DIRECTORY_ARG_NAME)
.required(false)
.takes_value(true)
.help("A directory to create the archive in. Optional, will use the current directory if not specified"),
),
])
.get_matches();
let subcommand_name = match arg_matches.subcommand_name() {
Some(name) => name,
None => bail!("No subcommand specified"),
};
let subcommand_matches = match arg_matches.subcommand_matches(subcommand_name) {
Some(matches) => matches,
None => bail!(
"No subcommand arguments were recognized for subcommand '{}'",
subcommand_name
),
};
let target_dir = Path::new(
subcommand_matches
.value_of(TARGET_DIRECTORY_ARG_NAME)
.unwrap_or("./"),
);
match subcommand_name {
LIST_SUBCOMMAND => {
let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) {
Some(archive) => Path::new(archive),
None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME),
};
list_archive(archive).await
}
EXTRACT_SUBCOMMAND => {
let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) {
Some(archive) => Path::new(archive),
None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME),
};
extract_archive(archive, target_dir).await
}
CREATE_SUBCOMMAND => {
let source_dir = match subcommand_matches.value_of(SOURCE_DIRECTORY_ARG_NAME) {
Some(source) => Path::new(source),
None => bail!("No '{}' argument is specified", SOURCE_DIRECTORY_ARG_NAME),
};
create_archive(source_dir, target_dir).await
}
unknown => bail!("Unknown subcommand {}", unknown),
}
}
async fn list_archive(archive: &Path) -> anyhow::Result<()> {
let archive = archive.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the archive path '{}'",
archive.display()
)
})?;
ensure!(
archive.is_file(),
"Path '{}' is not an archive file",
archive.display()
);
println!("Listing an archive at path '{}'", archive.display());
let archive_name = match archive.file_name().and_then(|name| name.to_str()) {
Some(name) => name,
None => bail!(
"Failed to get the archive name from the path '{}'",
archive.display()
),
};
let archive_bytes = fs::read(&archive)
.await
.context("Failed to read the archive bytes")?;
let header = compression::read_archive_header(archive_name, &mut archive_bytes.as_slice())
.await
.context("Failed to read the archive header")?;
let empty_path = Path::new("");
println!("-------------------------------");
let longest_path_in_archive = header
.files
.iter()
.filter_map(|file| Some(file.subpath.as_path(empty_path).to_str()?.len()))
.max()
.unwrap_or_default()
.max(METADATA_FILE_NAME.len());
for regular_file in &header.files {
println!(
"File: {:width$} uncompressed size: {} bytes",
regular_file.subpath.as_path(empty_path).display(),
regular_file.size,
width = longest_path_in_archive,
)
}
println!(
"File: {:width$} uncompressed size: {} bytes",
METADATA_FILE_NAME,
header.metadata_file_size,
width = longest_path_in_archive,
);
println!("-------------------------------");
Ok(())
}
async fn extract_archive(archive: &Path, target_dir: &Path) -> anyhow::Result<()> {
let archive = archive.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the archive path '{}'",
archive.display()
)
})?;
ensure!(
archive.is_file(),
"Path '{}' is not an archive file",
archive.display()
);
let archive_name = match archive.file_name().and_then(|name| name.to_str()) {
Some(name) => name,
None => bail!(
"Failed to get the archive name from the path '{}'",
archive.display()
),
};
if !target_dir.exists() {
fs::create_dir_all(target_dir).await.with_context(|| {
format!(
"Failed to create the target dir at path '{}'",
target_dir.display()
)
})?;
}
let target_dir = target_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the target dir path '{}'",
target_dir.display()
)
})?;
ensure!(
target_dir.is_dir(),
"Path '{}' is not a directory",
target_dir.display()
);
let mut dir_contents = fs::read_dir(&target_dir)
.await
.context("Failed to list the target directory contents")?;
let dir_entry = dir_contents
.next_entry()
.await
.context("Failed to list the target directory contents")?;
ensure!(
dir_entry.is_none(),
"Target directory '{}' is not empty",
target_dir.display()
);
println!(
"Extracting an archive at path '{}' into directory '{}'",
archive.display(),
target_dir.display()
);
let mut archive_file = fs::File::open(&archive).await.with_context(|| {
format!(
"Failed to get the archive name from the path '{}'",
archive.display()
)
})?;
let header = compression::read_archive_header(archive_name, &mut archive_file)
.await
.context("Failed to read the archive header")?;
compression::uncompress_with_header(&BTreeSet::new(), &target_dir, header, &mut archive_file)
.await
.context("Failed to extract the archive")
}
async fn create_archive(source_dir: &Path, target_dir: &Path) -> anyhow::Result<()> {
let source_dir = source_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the source dir path '{}'",
source_dir.display()
)
})?;
ensure!(
source_dir.is_dir(),
"Path '{}' is not a directory",
source_dir.display()
);
if !target_dir.exists() {
fs::create_dir_all(target_dir).await.with_context(|| {
format!(
"Failed to create the target dir at path '{}'",
target_dir.display()
)
})?;
}
let target_dir = target_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the target dir path '{}'",
target_dir.display()
)
})?;
ensure!(
target_dir.is_dir(),
"Path '{}' is not a directory",
target_dir.display()
);
println!(
"Compressing directory '{}' and creating resulting archive in directory '{}'",
source_dir.display(),
target_dir.display()
);
let mut metadata_file_contents = None;
let mut files_co_archive = Vec::new();
let mut source_dir_contents = fs::read_dir(&source_dir)
.await
.context("Failed to read the source directory contents")?;
while let Some(source_dir_entry) = source_dir_contents
.next_entry()
.await
.context("Failed to read a source dir entry")?
{
let entry_path = source_dir_entry.path();
if entry_path.is_file() {
if entry_path.file_name().and_then(|name| name.to_str()) == Some(METADATA_FILE_NAME) {
let metadata_bytes = fs::read(entry_path)
.await
.context("Failed to read metata file bytes in the source dir")?;
metadata_file_contents = Some(
TimelineMetadata::from_bytes(&metadata_bytes)
.context("Failed to parse metata file contents in the source dir")?,
);
} else {
files_co_archive.push(entry_path);
}
}
}
let metadata = match metadata_file_contents {
Some(metadata) => metadata,
None => bail!(
"No metadata file found in the source dir '{}', cannot create the archive",
source_dir.display()
),
};
let _ = compression::archive_files_as_stream(
&source_dir,
files_co_archive.iter(),
&metadata,
move |mut archive_streamer, archive_name| async move {
let archive_target = target_dir.join(&archive_name);
let mut archive_file = fs::File::create(&archive_target).await?;
io::copy(&mut archive_streamer, &mut archive_file).await?;
Ok(archive_target)
},
)
.await
.context("Failed to create an archive")?;
Ok(())
}

View File

@@ -6,8 +6,7 @@ use clap::{App, Arg};
use pageserver::layered_repository::metadata::TimelineMetadata;
use std::path::PathBuf;
use std::str::FromStr;
use zenith_utils::lsn::Lsn;
use zenith_utils::GIT_VERSION;
use utils::{lsn::Lsn, GIT_VERSION};
fn main() -> Result<()> {
let arg_matches = App::new("Zenith update metadata utility")

View File

@@ -4,22 +4,30 @@
//! file, or on the command line.
//! See also `settings.md` for better description on every parameter.
use anyhow::{bail, ensure, Context, Result};
use toml_edit;
use toml_edit::{Document, Item};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId};
use std::convert::TryInto;
use anyhow::{anyhow, bail, ensure, Context, Result};
use std::env;
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::{Path, PathBuf};
use std::str::FromStr;
use std::time::Duration;
use toml_edit;
use toml_edit::{Document, Item};
use utils::{
postgres_backend::AuthType,
zid::{ZNodeId, ZTenantId, ZTimelineId},
};
use crate::layered_repository::TIMELINES_SEGMENT_NAME;
use crate::tenant_config::{TenantConf, TenantConfOpt};
pub const ZSTD_MAX_SAMPLES: usize = 1024;
pub const ZSTD_MIN_SAMPLES: usize = 8; // magic requirement of zstd
pub const ZSTD_MAX_DICTIONARY_SIZE: usize = 64 * 1024;
pub const ZSTD_COMPRESSION_LEVEL: i32 = 0; // default compression level
pub const ZSTD_DECOMPRESS_BUFFER_LIMIT: usize = 64 * 1024; // TODO: handle larger WAL records?
pub mod defaults {
use crate::tenant_config::defaults::*;
use const_format::formatcp;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
@@ -27,21 +35,22 @@ pub mod defaults {
pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 9898;
pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}");
// FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB
// would be more appropriate. But a low value forces the code to be exercised more,
// which is good for now to trigger bugs.
pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024;
pub const DEFAULT_CHECKPOINT_PERIOD: &str = "1 s";
pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
pub const DEFAULT_GC_PERIOD: &str = "100 s";
pub const DEFAULT_WAIT_LSN_TIMEOUT: &str = "60 s";
pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s";
pub const DEFAULT_SUPERUSER: &str = "zenith_admin";
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 100;
/// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage.
/// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency
/// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach.
/// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed.
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC: usize = 50;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
/// Currently, sync happens with AWS S3, that has two limits on requests per second:
/// ~200 RPS for IAM services
/// https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html
/// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests
/// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100;
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100;
@@ -56,12 +65,6 @@ pub mod defaults {
#listen_pg_addr = '{DEFAULT_PG_LISTEN_ADDR}'
#listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}'
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#checkpoint_period = '{DEFAULT_CHECKPOINT_PERIOD}'
#gc_period = '{DEFAULT_GC_PERIOD}'
#gc_horizon = {DEFAULT_GC_HORIZON}
#wait_lsn_timeout = '{DEFAULT_WAIT_LSN_TIMEOUT}'
#wal_redo_timeout = '{DEFAULT_WAL_REDO_TIMEOUT}'
@@ -70,6 +73,17 @@ pub mod defaults {
# initial superuser role name to use when creating a new tenant
#initial_superuser_name = '{DEFAULT_SUPERUSER}'
# [tenant_config]
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes
#compaction_period = '{DEFAULT_COMPACTION_PERIOD}'
#compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}'
#gc_period = '{DEFAULT_GC_PERIOD}'
#gc_horizon = {DEFAULT_GC_HORIZON}
#image_creation_threshold = {DEFAULT_IMAGE_CREATION_THRESHOLD}
#pitr_interval = '{DEFAULT_PITR_INTERVAL}'
# [remote_storage]
"###
@@ -87,15 +101,6 @@ pub struct PageServerConf {
/// Example (default): 127.0.0.1:9898
pub listen_http_addr: String,
// Flush out an inmemory layer, if it's holding WAL older than this
// This puts a backstop on how much WAL needs to be re-digested if the
// page server crashes.
pub checkpoint_distance: u64,
pub checkpoint_period: Duration,
pub gc_horizon: u64,
pub gc_period: Duration,
// Timeout when waiting for WAL receiver to catch up to an LSN given in a GetPage@LSN call.
pub wait_lsn_timeout: Duration,
// How long to wait for WAL redo to complete.
@@ -120,6 +125,28 @@ pub struct PageServerConf {
pub auth_validation_public_key_path: Option<PathBuf>,
pub remote_storage_config: Option<RemoteStorageConfig>,
pub profiling: ProfilingConfig,
pub default_tenant_conf: TenantConf,
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ProfilingConfig {
Disabled,
PageRequests,
}
impl FromStr for ProfilingConfig {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<ProfilingConfig, Self::Err> {
let result = match s {
"disabled" => ProfilingConfig::Disabled,
"page_requests" => ProfilingConfig::PageRequests,
_ => bail!("invalid value \"{s}\" for profiling option, valid values are \"disabled\" and \"page_requests\""),
};
Ok(result)
}
}
// use dedicated enum for builder to better indicate the intention
@@ -144,12 +171,6 @@ struct PageServerConfigBuilder {
listen_http_addr: BuilderValue<String>,
checkpoint_distance: BuilderValue<u64>,
checkpoint_period: BuilderValue<Duration>,
gc_horizon: BuilderValue<u64>,
gc_period: BuilderValue<Duration>,
wait_lsn_timeout: BuilderValue<Duration>,
wal_redo_timeout: BuilderValue<Duration>,
@@ -169,6 +190,8 @@ struct PageServerConfigBuilder {
remote_storage_config: BuilderValue<Option<RemoteStorageConfig>>,
id: BuilderValue<ZNodeId>,
profiling: BuilderValue<ProfilingConfig>,
}
impl Default for PageServerConfigBuilder {
@@ -178,12 +201,6 @@ impl Default for PageServerConfigBuilder {
Self {
listen_pg_addr: Set(DEFAULT_PG_LISTEN_ADDR.to_string()),
listen_http_addr: Set(DEFAULT_HTTP_LISTEN_ADDR.to_string()),
checkpoint_distance: Set(DEFAULT_CHECKPOINT_DISTANCE),
checkpoint_period: Set(humantime::parse_duration(DEFAULT_CHECKPOINT_PERIOD)
.expect("cannot parse default checkpoint period")),
gc_horizon: Set(DEFAULT_GC_HORIZON),
gc_period: Set(humantime::parse_duration(DEFAULT_GC_PERIOD)
.expect("cannot parse default gc period")),
wait_lsn_timeout: Set(humantime::parse_duration(DEFAULT_WAIT_LSN_TIMEOUT)
.expect("cannot parse default wait lsn timeout")),
wal_redo_timeout: Set(humantime::parse_duration(DEFAULT_WAL_REDO_TIMEOUT)
@@ -199,6 +216,7 @@ impl Default for PageServerConfigBuilder {
auth_validation_public_key_path: Set(None),
remote_storage_config: Set(None),
id: NotSet,
profiling: Set(ProfilingConfig::Disabled),
}
}
}
@@ -212,22 +230,6 @@ impl PageServerConfigBuilder {
self.listen_http_addr = BuilderValue::Set(listen_http_addr)
}
pub fn checkpoint_distance(&mut self, checkpoint_distance: u64) {
self.checkpoint_distance = BuilderValue::Set(checkpoint_distance)
}
pub fn checkpoint_period(&mut self, checkpoint_period: Duration) {
self.checkpoint_period = BuilderValue::Set(checkpoint_period)
}
pub fn gc_horizon(&mut self, gc_horizon: u64) {
self.gc_horizon = BuilderValue::Set(gc_horizon)
}
pub fn gc_period(&mut self, gc_period: Duration) {
self.gc_period = BuilderValue::Set(gc_period)
}
pub fn wait_lsn_timeout(&mut self, wait_lsn_timeout: Duration) {
self.wait_lsn_timeout = BuilderValue::Set(wait_lsn_timeout)
}
@@ -275,49 +277,46 @@ impl PageServerConfigBuilder {
self.id = BuilderValue::Set(node_id)
}
pub fn profiling(&mut self, profiling: ProfilingConfig) {
self.profiling = BuilderValue::Set(profiling)
}
pub fn build(self) -> Result<PageServerConf> {
Ok(PageServerConf {
listen_pg_addr: self
.listen_pg_addr
.ok_or(anyhow::anyhow!("missing listen_pg_addr"))?,
.ok_or(anyhow!("missing listen_pg_addr"))?,
listen_http_addr: self
.listen_http_addr
.ok_or(anyhow::anyhow!("missing listen_http_addr"))?,
checkpoint_distance: self
.checkpoint_distance
.ok_or(anyhow::anyhow!("missing checkpoint_distance"))?,
checkpoint_period: self
.checkpoint_period
.ok_or(anyhow::anyhow!("missing checkpoint_period"))?,
gc_horizon: self
.gc_horizon
.ok_or(anyhow::anyhow!("missing gc_horizon"))?,
gc_period: self.gc_period.ok_or(anyhow::anyhow!("missing gc_period"))?,
.ok_or(anyhow!("missing listen_http_addr"))?,
wait_lsn_timeout: self
.wait_lsn_timeout
.ok_or(anyhow::anyhow!("missing wait_lsn_timeout"))?,
.ok_or(anyhow!("missing wait_lsn_timeout"))?,
wal_redo_timeout: self
.wal_redo_timeout
.ok_or(anyhow::anyhow!("missing wal_redo_timeout"))?,
superuser: self.superuser.ok_or(anyhow::anyhow!("missing superuser"))?,
.ok_or(anyhow!("missing wal_redo_timeout"))?,
superuser: self.superuser.ok_or(anyhow!("missing superuser"))?,
page_cache_size: self
.page_cache_size
.ok_or(anyhow::anyhow!("missing page_cache_size"))?,
.ok_or(anyhow!("missing page_cache_size"))?,
max_file_descriptors: self
.max_file_descriptors
.ok_or(anyhow::anyhow!("missing max_file_descriptors"))?,
workdir: self.workdir.ok_or(anyhow::anyhow!("missing workdir"))?,
.ok_or(anyhow!("missing max_file_descriptors"))?,
workdir: self.workdir.ok_or(anyhow!("missing workdir"))?,
pg_distrib_dir: self
.pg_distrib_dir
.ok_or(anyhow::anyhow!("missing pg_distrib_dir"))?,
auth_type: self.auth_type.ok_or(anyhow::anyhow!("missing auth_type"))?,
.ok_or(anyhow!("missing pg_distrib_dir"))?,
auth_type: self.auth_type.ok_or(anyhow!("missing auth_type"))?,
auth_validation_public_key_path: self
.auth_validation_public_key_path
.ok_or(anyhow::anyhow!("missing auth_validation_public_key_path"))?,
.ok_or(anyhow!("missing auth_validation_public_key_path"))?,
remote_storage_config: self
.remote_storage_config
.ok_or(anyhow::anyhow!("missing remote_storage_config"))?,
id: self.id.ok_or(anyhow::anyhow!("missing id"))?,
.ok_or(anyhow!("missing remote_storage_config"))?,
id: self.id.ok_or(anyhow!("missing id"))?,
profiling: self.profiling.ok_or(anyhow!("missing profiling"))?,
// TenantConf is handled separately
default_tenant_conf: TenantConf::default(),
})
}
}
@@ -326,7 +325,7 @@ impl PageServerConfigBuilder {
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between pageserver and the remote storage.
pub max_concurrent_sync: NonZeroUsize,
pub max_concurrent_timelines_sync: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration.
@@ -337,10 +336,10 @@ pub struct RemoteStorageConfig {
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum RemoteStorageKind {
/// Storage based on local file system.
/// Specify a root folder to place all stored relish data into.
/// Specify a root folder to place all stored files into.
LocalFs(PathBuf),
/// AWS S3 based storage, storing all relishes into the root
/// of the S3 bucket from the config.
/// AWS S3 based storage, storing all files in the S3 bucket
/// specified by the config
AwsS3(S3Config),
}
@@ -367,6 +366,9 @@ pub struct S3Config {
///
/// Example: `http://127.0.0.1:5000`
pub endpoint: Option<String>,
/// AWS S3 has various limits on its API calls, we need not to exceed those.
/// See [`defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details.
pub concurrency_limit: NonZeroUsize,
}
impl std::fmt::Debug for S3Config {
@@ -375,6 +377,7 @@ impl std::fmt::Debug for S3Config {
.field("bucket_name", &self.bucket_name)
.field("bucket_region", &self.bucket_region)
.field("prefix_in_bucket", &self.prefix_in_bucket)
.field("concurrency_limit", &self.concurrency_limit)
.finish()
}
}
@@ -420,14 +423,12 @@ impl PageServerConf {
let mut builder = PageServerConfigBuilder::default();
builder.workdir(workdir.to_owned());
let mut t_conf: TenantConfOpt = Default::default();
for (key, item) in toml.iter() {
match key {
"listen_pg_addr" => builder.listen_pg_addr(parse_toml_string(key, item)?),
"listen_http_addr" => builder.listen_http_addr(parse_toml_string(key, item)?),
"checkpoint_distance" => builder.checkpoint_distance(parse_toml_u64(key, item)?),
"checkpoint_period" => builder.checkpoint_period(parse_toml_duration(key, item)?),
"gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?),
"gc_period" => builder.gc_period(parse_toml_duration(key, item)?),
"wait_lsn_timeout" => builder.wait_lsn_timeout(parse_toml_duration(key, item)?),
"wal_redo_timeout" => builder.wal_redo_timeout(parse_toml_duration(key, item)?),
"initial_superuser_name" => builder.superuser(parse_toml_string(key, item)?),
@@ -441,12 +442,16 @@ impl PageServerConf {
"auth_validation_public_key_path" => builder.auth_validation_public_key_path(Some(
PathBuf::from(parse_toml_string(key, item)?),
)),
"auth_type" => builder.auth_type(parse_toml_auth_type(key, item)?),
"auth_type" => builder.auth_type(parse_toml_from_str(key, item)?),
"remote_storage" => {
builder.remote_storage_config(Some(Self::parse_remote_storage_config(item)?))
}
"tenant_config" => {
t_conf = Self::parse_toml_tenant_conf(item)?;
}
"id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)),
_ => bail!("unrecognized pageserver option '{}'", key),
"profiling" => builder.profiling(parse_toml_from_str(key, item)?),
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -472,41 +477,75 @@ impl PageServerConf {
);
}
conf.default_tenant_conf = t_conf.merge(TenantConf::default());
Ok(conf)
}
// subroutine of parse_and_validate to parse `[tenant_conf]` section
pub fn parse_toml_tenant_conf(item: &toml_edit::Item) -> Result<TenantConfOpt> {
let mut t_conf: TenantConfOpt = Default::default();
if let Some(checkpoint_distance) = item.get("checkpoint_distance") {
t_conf.checkpoint_distance =
Some(parse_toml_u64("checkpoint_distance", checkpoint_distance)?);
}
if let Some(compaction_target_size) = item.get("compaction_target_size") {
t_conf.compaction_target_size = Some(parse_toml_u64(
"compaction_target_size",
compaction_target_size,
)?);
}
if let Some(compaction_period) = item.get("compaction_period") {
t_conf.compaction_period =
Some(parse_toml_duration("compaction_period", compaction_period)?);
}
if let Some(compaction_threshold) = item.get("compaction_threshold") {
t_conf.compaction_threshold =
Some(parse_toml_u64("compaction_threshold", compaction_threshold)?.try_into()?);
}
if let Some(gc_horizon) = item.get("gc_horizon") {
t_conf.gc_horizon = Some(parse_toml_u64("gc_horizon", gc_horizon)?);
}
if let Some(gc_period) = item.get("gc_period") {
t_conf.gc_period = Some(parse_toml_duration("gc_period", gc_period)?);
}
if let Some(pitr_interval) = item.get("pitr_interval") {
t_conf.pitr_interval = Some(parse_toml_duration("pitr_interval", pitr_interval)?);
}
Ok(t_conf)
}
/// subroutine of parse_config(), to parse the `[remote_storage]` table.
fn parse_remote_storage_config(toml: &toml_edit::Item) -> anyhow::Result<RemoteStorageConfig> {
let local_path = toml.get("local_path");
let bucket_name = toml.get("bucket_name");
let bucket_region = toml.get("bucket_region");
let max_concurrent_sync: NonZeroUsize = if let Some(s) = toml.get("max_concurrent_sync") {
parse_toml_u64("max_concurrent_sync", s)
.and_then(|toml_u64| {
toml_u64.try_into().with_context(|| {
format!("'max_concurrent_sync' value {} is too large", toml_u64)
})
})
.ok()
.and_then(NonZeroUsize::new)
.context("'max_concurrent_sync' must be a non-zero positive integer")?
} else {
NonZeroUsize::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC).unwrap()
};
let max_sync_errors: NonZeroU32 = if let Some(s) = toml.get("max_sync_errors") {
parse_toml_u64("max_sync_errors", s)
.and_then(|toml_u64| {
toml_u64.try_into().with_context(|| {
format!("'max_sync_errors' value {} is too large", toml_u64)
})
})
.ok()
.and_then(NonZeroU32::new)
.context("'max_sync_errors' must be a non-zero positive integer")?
} else {
NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS).unwrap()
};
let max_concurrent_timelines_sync = NonZeroUsize::new(
parse_optional_integer("max_concurrent_timelines_sync", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC),
)
.context("Failed to parse 'max_concurrent_timelines_sync' as a positive integer")?;
let max_sync_errors = NonZeroU32::new(
parse_optional_integer("max_sync_errors", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS),
)
.context("Failed to parse 'max_sync_errors' as a positive integer")?;
let concurrency_limit = NonZeroUsize::new(
parse_optional_integer("concurrency_limit", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT),
)
.context("Failed to parse 'concurrency_limit' as a positive integer")?;
let storage = match (local_path, bucket_name, bucket_region) {
(None, None, None) => bail!("no 'local_path' nor 'bucket_name' option"),
@@ -537,6 +576,7 @@ impl PageServerConf {
.get("endpoint")
.map(|endpoint| parse_toml_string("endpoint", endpoint))
.transpose()?,
concurrency_limit,
}),
(Some(local_path), None, None) => RemoteStorageKind::LocalFs(PathBuf::from(
parse_toml_string("local_path", local_path)?,
@@ -545,7 +585,7 @@ impl PageServerConf {
};
Ok(RemoteStorageConfig {
max_concurrent_sync,
max_concurrent_timelines_sync,
max_sync_errors,
storage,
})
@@ -553,17 +593,13 @@ impl PageServerConf {
#[cfg(test)]
pub fn test_repo_dir(test_name: &str) -> PathBuf {
PathBuf::from(format!("../tmp_check/test_{}", test_name))
PathBuf::from(format!("../tmp_check/test_{test_name}"))
}
#[cfg(test)]
pub fn dummy_conf(repo_dir: PathBuf) -> Self {
PageServerConf {
id: ZNodeId(0),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: Duration::from_secs(10),
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: Duration::from_secs(10),
wait_lsn_timeout: Duration::from_secs(60),
wal_redo_timeout: Duration::from_secs(60),
page_cache_size: defaults::DEFAULT_PAGE_CACHE_SIZE,
@@ -576,6 +612,8 @@ impl PageServerConf {
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::dummy_conf(),
}
}
}
@@ -585,7 +623,7 @@ impl PageServerConf {
fn parse_toml_string(name: &str, item: &Item) -> Result<String> {
let s = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
.with_context(|| format!("configure option {name} is not a string"))?;
Ok(s.to_string())
}
@@ -594,26 +632,46 @@ fn parse_toml_u64(name: &str, item: &Item) -> Result<u64> {
// for our use, though.
let i: i64 = item
.as_integer()
.with_context(|| format!("configure option {} is not an integer", name))?;
.with_context(|| format!("configure option {name} is not an integer"))?;
if i < 0 {
bail!("configure option {} cannot be negative", name);
bail!("configure option {name} cannot be negative");
}
Ok(i as u64)
}
fn parse_optional_integer<I, E>(name: &str, item: &toml_edit::Item) -> anyhow::Result<Option<I>>
where
I: TryFrom<i64, Error = E>,
E: std::error::Error + Send + Sync + 'static,
{
let toml_integer = match item.get(name) {
Some(item) => item
.as_integer()
.with_context(|| format!("configure option {name} is not an integer"))?,
None => return Ok(None),
};
I::try_from(toml_integer)
.map(Some)
.with_context(|| format!("configure option {name} is too large"))
}
fn parse_toml_duration(name: &str, item: &Item) -> Result<Duration> {
let s = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
.with_context(|| format!("configure option {name} is not a string"))?;
Ok(humantime::parse_duration(s)?)
}
fn parse_toml_auth_type(name: &str, item: &Item) -> Result<AuthType> {
fn parse_toml_from_str<T>(name: &str, item: &Item) -> Result<T>
where
T: FromStr<Err = anyhow::Error>,
{
let v = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
AuthType::from_str(v)
.with_context(|| format!("configure option {name} is not a string"))?;
T::from_str(v)
}
#[cfg(test)]
@@ -630,12 +688,6 @@ mod tests {
listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
checkpoint_distance = 111 # in bytes
checkpoint_period = '111 s'
gc_period = '222 s'
gc_horizon = 222
wait_lsn_timeout = '111 s'
wal_redo_timeout = '111 s'
@@ -656,10 +708,8 @@ id = 10
let config_string = format!("pg_distrib_dir='{}'\nid=10", pg_distrib_dir.display());
let toml = config_string.parse()?;
let parsed_config =
PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
});
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
assert_eq!(
parsed_config,
@@ -667,10 +717,6 @@ id = 10
id: ZNodeId(10),
listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(),
listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: humantime::parse_duration(defaults::DEFAULT_CHECKPOINT_PERIOD)?,
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?,
wait_lsn_timeout: humantime::parse_duration(defaults::DEFAULT_WAIT_LSN_TIMEOUT)?,
wal_redo_timeout: humantime::parse_duration(defaults::DEFAULT_WAL_REDO_TIMEOUT)?,
superuser: defaults::DEFAULT_SUPERUSER.to_string(),
@@ -681,6 +727,8 @@ id = 10
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
},
"Correct defaults should be used when no config values are provided"
);
@@ -694,16 +742,13 @@ id = 10
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let config_string = format!(
"{}pg_distrib_dir='{}'",
ALL_BASE_VALUES_TOML,
"{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'",
pg_distrib_dir.display()
);
let toml = config_string.parse()?;
let parsed_config =
PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
});
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
assert_eq!(
parsed_config,
@@ -711,10 +756,6 @@ id = 10
id: ZNodeId(10),
listen_pg_addr: "127.0.0.1:64000".to_string(),
listen_http_addr: "127.0.0.1:9898".to_string(),
checkpoint_distance: 111,
checkpoint_period: Duration::from_secs(111),
gc_horizon: 222,
gc_period: Duration::from_secs(222),
wait_lsn_timeout: Duration::from_secs(111),
wal_redo_timeout: Duration::from_secs(111),
superuser: "zzzz".to_string(),
@@ -725,6 +766,8 @@ id = 10
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
},
"Should be able to parse all basic config values correctly"
);
@@ -753,37 +796,33 @@ local_path = '{}'"#,
for remote_storage_config_str in identical_toml_declarations {
let config_string = format!(
r#"{}
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
{}"#,
ALL_BASE_VALUES_TOML,
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
remote_storage_config_str,
);
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
})
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.remote_storage_config
.expect("Should have remote storage config for the local FS");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_sync: NonZeroUsize::new(
defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC
)
.unwrap(),
max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_timelines_sync: NonZeroUsize::new(
defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC
)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
},
"Remote storage config should correctly parse the local FS config and fill other storage defaults"
);
max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
},
"Remote storage config should correctly parse the local FS config and fill other storage defaults"
);
}
Ok(())
}
@@ -799,52 +838,49 @@ pg_distrib_dir='{}'
let access_key_id = "SOMEKEYAAAAASADSAH*#".to_string();
let secret_access_key = "SOMEsEcReTsd292v".to_string();
let endpoint = "http://localhost:5000".to_string();
let max_concurrent_sync = NonZeroUsize::new(111).unwrap();
let max_concurrent_timelines_sync = NonZeroUsize::new(111).unwrap();
let max_sync_errors = NonZeroU32::new(222).unwrap();
let s3_concurrency_limit = NonZeroUsize::new(333).unwrap();
let identical_toml_declarations = &[
format!(
r#"[remote_storage]
max_concurrent_sync = {}
max_sync_errors = {}
bucket_name = '{}'
bucket_region = '{}'
prefix_in_bucket = '{}'
access_key_id = '{}'
secret_access_key = '{}'
endpoint = '{}'"#,
max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint
max_concurrent_timelines_sync = {max_concurrent_timelines_sync}
max_sync_errors = {max_sync_errors}
bucket_name = '{bucket_name}'
bucket_region = '{bucket_region}'
prefix_in_bucket = '{prefix_in_bucket}'
access_key_id = '{access_key_id}'
secret_access_key = '{secret_access_key}'
endpoint = '{endpoint}'
concurrency_limit = {s3_concurrency_limit}"#
),
format!(
"remote_storage={{max_concurrent_sync={}, max_sync_errors={}, bucket_name='{}', bucket_region='{}', prefix_in_bucket='{}', access_key_id='{}', secret_access_key='{}', endpoint='{}'}}",
max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint
"remote_storage={{max_concurrent_timelines_sync={max_concurrent_timelines_sync}, max_sync_errors={max_sync_errors}, bucket_name='{bucket_name}',\
bucket_region='{bucket_region}', prefix_in_bucket='{prefix_in_bucket}', access_key_id='{access_key_id}', secret_access_key='{secret_access_key}', endpoint='{endpoint}', concurrency_limit={s3_concurrency_limit}}}",
),
];
for remote_storage_config_str in identical_toml_declarations {
let config_string = format!(
r#"{}
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
{}"#,
ALL_BASE_VALUES_TOML,
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
remote_storage_config_str,
);
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
})
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.remote_storage_config
.expect("Should have remote storage config for S3");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_sync,
max_concurrent_timelines_sync,
max_sync_errors,
storage: RemoteStorageKind::AwsS3(S3Config {
bucket_name: bucket_name.clone(),
@@ -852,7 +888,8 @@ pg_distrib_dir='{}'
access_key_id: Some(access_key_id.clone()),
secret_access_key: Some(secret_access_key.clone()),
prefix_in_bucket: Some(prefix_in_bucket.clone()),
endpoint: Some(endpoint.clone())
endpoint: Some(endpoint.clone()),
concurrency_limit: s3_concurrency_limit,
}),
},
"Remote storage config should correctly parse the S3 config"

Some files were not shown because too many files have changed in this diff Show More