Compare commits

..

164 Commits

Author SHA1 Message Date
Arthur Petukhovsky
8c3b4fab14 Deploy backported patch to prod ps-3. 2022-07-13 15:23:43 +03:00
Bojan Serafimov
bacc52cf03 CI-only clippy errors 2022-07-12 17:46:27 -04:00
Bojan Serafimov
5bcc5e4891 Fix errors from new clippy version 2022-07-12 16:26:53 -04:00
Bojan Serafimov
05e3d16a18 More conflict resolution 2022-07-12 15:23:52 -04:00
Bojan Serafimov
4724e4b2d1 Resolve conflicts 2022-07-12 15:09:44 -04:00
Heikki Linnakangas
971c03873f Optimize importing a physical backup
Before this patch, importing a physical backup followed the same path
as ingesting any WAL records:

1. All the data pages from the backup are first collected in the
   DatadirModification object.
2. Then, they are "committed" to the Repository. They are written to
   the in-memory layer
3. Finally, the in-memory layer is frozen, and flushed to disk as a
   L0 delta layer file.

This was pretty inefficient. In step 1, the whole physical backup was
held in memory. If the backup is large, you simply run out of
memory. And in step 3, the resulting L0 delta layer file is large,
holding all the data again. That's a problem if the backup is larger
than 5 GB: Amazon S3 doesn't allow uploading files larger than 5 GB
(without using multi-part upload, see github issue #1910). So we want
to avoid that.

To alleviate those problems, optimize the codepath for importing a
physical backup. The basic flow is the same as before, but step 1
is optimized so that it doesn't accumulate all the data in memory,
and step 3 writes the data in image layers instead of one large delta
layer.
2022-07-12 14:54:34 -04:00
Heikki Linnakangas
ffd778a4a2 If an error happens during import of base backup or WAL, log it.
We only sent the error to the client, with no trace in the pageserver log.
Log it, similar to how we log errors in GetPage@LSN requests.
2022-07-12 14:53:09 -04:00
bojanserafimov
a25ccce3c8 Fix signal file parsing (#2042) 2022-07-12 14:52:54 -04:00
Bojan Serafimov
1389d6b6a5 Fix gc after import 2022-07-12 14:52:43 -04:00
bojanserafimov
8ca3faa61e Import basebackup into pageserver (#1925)
Allow importing basebackup taken from vanilla postgres or another pageserver via psql copy in protocol.
2022-07-12 14:52:30 -04:00
Alexey Kondratov
f88fe0218d Merge pull request #1842 from neondatabase/release-deploy-hotfix
[HOTFIX] Release deploy fix

This PR uses this branch neondatabase/postgres#171 and several required commits from the main to use only locally built compute-tools. This should allow us to rollout safekeepers sync issue fix on prod
2022-06-01 11:04:30 +03:00
Alexey Kondratov
cc856eca85 Install missing openssl packages in the Github Actions workflow 2022-05-31 21:31:31 +02:00
Alexey Kondratov
cf350c6002 Use :local compute-tools tag to build compute-node image 2022-05-31 21:31:16 +02:00
Arseny Sher
0ce6b6a0a3 Merge pull request #1836 from neondatabase/release-hotfix-basebackup-lsn-page-boundary
Bump vendor/postgres to hotfix basebackup LSN comparison.
2022-05-31 16:54:03 +04:00
Arseny Sher
73f247d537 Bump vendor/postgres to hotfix basebackup LSN comparison. 2022-05-31 16:00:50 +04:00
Andrey Taranik
960be82183 Merge pull request #1792 from neondatabase/main
Release 2202-05-25 (second)
2022-05-25 16:37:57 +03:00
Andrey Taranik
9ab52e2186 helm repository name fix for production proxy deploy (#1790) 2022-05-25 15:41:18 +03:00
Heikki Linnakangas
6f1f33ef42 Improve error messages on seccomp loading errors.
Bump vendor/postgres for https://github.com/neondatabase/postgres/pull/166
2022-05-25 14:33:06 +03:00
Andrey Taranik
703f691df8 production inventory update (#1779) 2022-05-25 14:30:50 +03:00
Arseny Sher
2b265fd6dc Disable restart_after_crash in neon_local.
It is pointless when basebackup is invalid.
2022-05-25 14:48:11 +04:00
Andrey Taranik
806e5a6c19 Merge pull request #1787 from neondatabase/main
Release 2022-05-25
2022-05-25 13:34:11 +03:00
Sergey Melnikov
d32b491a53 Add zenith-us-stage-sk-6 to deploy (#1728) 2022-05-25 10:31:10 +03:00
Kirill Bulatov
541ec25875 Properly shutdown test mock S3 server 2022-05-24 19:09:31 +03:00
KlimentSerafimov
8346aa3a29 Potential fix to #1626. Fixed typo is Makefile. (#1781)
* Potential fix to #1626. Fixed typo is Makefile.
* Completed fix to #1626.

Summary:
changed 'error' to 'bail' in start_pageserver and start_safekeeper.
2022-05-24 04:55:38 -04:00
Heikki Linnakangas
2aceb6a309 Fix garbage collection to not remove image layers that are still needed.
The logic would incorrectly remove an image layer, if a new image layer
existed, even though the older image layer was still needed by some
delta layers after it. See example given in the comment this adds.

Without this fix, I was getting a lot of "could not find data for key
010000000000000000000000000000000000" errors from GC, with the new test
case being added in PR #1735.

Fixes #707
2022-05-23 20:58:27 +03:00
KlimentSerafimov
3ff5caf786 Add to readme install protobuf etcd (#1777)
* Update installation instructions
* Added libprotobuf-dev etcd to apt install

Added "brew install protobuf etcd" to OSX installation instructions.
Added "sudo apt install libprotobuf-dev etcd" to Linux installation instructions.
Without these, cargo build complains. 
Figured out in collaboration with Bojan.
2022-05-23 13:11:59 -04:00
chaitanya sharma
fbedd535c0 Replace a bunch of zenith references with neon. 2022-05-23 13:16:00 +03:00
Egor Suvorov
89e5659f3f Replace COPYRIGHT file from the root with NOTICE file
The primary reason: make GitHub detect that we use Apache License 2.0
They do it via https://github.com/licensee/licensee Ruby library (gem).

Our COPYRIGHT file contains a part of the Apache License, which should
be added to a source file, not the license or copyright information itself,
which confuses the library.

Instead, the recommended way is to create a NOTICE file which references
license of the code and its bundled dependencies.
2022-05-23 01:03:03 +02:00
Egor Suvorov
ef7cdb13e2 Remove unused dependencies from poetry.lock via poetry lock --no-update
There were a bunch of dependencies for Python <3.9. They are not needed
after #1254. This commit makes it easier to add/remove dependencies because
lock file will be updated like this on any such operation.

Do not update dependencies yet to not break anything.
2022-05-21 12:21:45 +02:00
Egor Suvorov
73187bfef1 postgres_ffi: find_end_of_wal_segment: clarify code around xl_crc retrieval
It would be better to not update xl_crc/rec_hdr at all when skipping contrecord,
but I would prefer to keep PR #1574 small.
Better audit of `find_end_of_wal_segment` is coming anyway in #544.
2022-05-21 05:25:17 +02:00
Egor Suvorov
967eb38e81 postgres_ffi: find_end_of_wal_segment: fix contrecord skipping
Also enable corresponding test.
2022-05-21 05:25:17 +02:00
Egor Suvorov
a124e44866 postgres_ffi: find_end_of_wal_segment: add lots of trace 2022-05-21 05:25:17 +02:00
Egor Suvorov
c4b77084af utils: add const_assert! macro 2022-05-21 05:25:17 +02:00
Egor Suvorov
c9efdec8db postgres_ffi: find_end_of_wal_segment: improve name of wal_crc variable
Now it reflects the field it's mirroring.
2022-05-21 05:25:17 +02:00
Egor Suvorov
12b7c793b3 postgres_ffi: find_end_of_wal_segment: remove redundant CRC operations
Previous invariant: `crc` contains an "unfinalized" CRC32 value,
its one complement, like in postgres before FIN_CRC32C.

New invariant: `crc` always contains a "finalized" CRC32 value,
this is the semantics of crc32c_append, so we don't need to invert CRC manually.
2022-05-21 05:25:17 +02:00
Egor Suvorov
3c6890bf1d postgres_ffi: add complex WAL tests for find_end_of_wal
* Actual generation logic is in a separate crate `postgres_ffi/wal_generate`
* The create also provides a binary for debug purposes akin to `initdb`
* Two tests currently fail and are ignored
* There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn.
  So testing would require disconnecting Safekeeper just after it has received the WAL,
  but before it is committed.
2022-05-21 05:25:17 +02:00
Andrey Taranik
d97617ed3a updated proxy and proxy scram deployment for prod and stress environments (#1758) 2022-05-20 23:12:30 +03:00
KlimentSerafimov
65cf1a3221 Added paths to openssl includes and libraries for OSX because make complained that it couldn't find them. (#1761) 2022-05-20 12:02:51 -04:00
bojanserafimov
a4aef5d8dc Compile psql with openssl (#1725) 2022-05-19 12:25:31 -04:00
Heikki Linnakangas
ffbb9dd155 Add a 5 minute timeout to python tests.
The CI times out after 10 minutes of no output. It's annoying if a
test hangs and is killed by the CI timeout, because you don't get
information about which test was running. Try to avoid that, by adding
a slightly smaller timeout in pytest itself. You can override it on a
per-test basis if needed, but let's try to keep our tests shorter than
that.

For the Postgres regression tests, use a longer 30 minute timeout.
They're not really a single test, but many tests wrapped in a single
pytest test. It's OK for them to run longer in aggregate, each
Postgres test is still fairly short.
2022-05-19 14:04:14 +03:00
Egor Suvorov
baf7a81dce git-upload: pass committer to 'git rebase' (fix #1749) (#1750)
No committer was specified, which resulted in failing `git rebase` if
the branch is not up-to-date.
2022-05-19 14:01:03 +03:00
Heikki Linnakangas
ee3bcf108d Fix compact_level0 for delta layers with overlap or gaps
We saw a case in staging, where there was a gap in the LSN ranges of
level 0 files, like this:

    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016960E9-00000000016E4DB9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016E4DB9-000000000BFCE3E1
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000BFCE3E1-000000000BFD0FE9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000060045901-000000007005EAC1
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000007005EAC1-0000000080062E99
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000080062E99-000000009007F481
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000009007F481-00000000A009F7C9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000A009F7C9-00000000AA284EB9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000AA286471-00000000AA2886B9

Note that gap between 000000000BFD0FE9 and 0000000060045901. I don't
know how that happened, but in general the pageserver should be robust
if there are gaps like that, or overlapping files etc. In theory they
could happen as result of crashes, partial downloads from S3 etc.,
although it is mystery what caused it in this case.

Looking at the compaction code, it was not safe in the face of gaps
like that. The compaction routine collected all the level 0 files, and
took their min(start)..max(end) as the range of the new files it
builds. That's wrong, if the level 0 files don't cover the whole LSN
range; the newly created files will miss any records in the gap. Fix
that, by only collecting contiguous sequences of level 0 files, so
that the end LSN of previous delta file is equal to the start of the
next one.

Fixes issue #1730
2022-05-19 10:19:38 +03:00
Heikki Linnakangas
0da4046704 Include traversal path in error message.
Previously, the path was printed to the log with separate error!() calls.
It's better to include the whole path in the error object and have it
printed to the log as one message.

Also print the path in the ValueReconstructResult::Missing case.

This is what it looks like now:

    2022-05-17T21:53:53.611801Z ERROR pagestream{timeline=5adcb4af3e95f00a31550d266aab7a37 tenant=74d9f9ad3293c030c6a6e196dd91c60f}: error reading relation or page version: could not find data for key 000000067F000032BE000000000000000001 at LSN 0/1698C48, for request at LSN 0/1698CF8

    Caused by:
        0: layer traversal: result Complete, cont_lsn 0/1698C48, layer: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001698C48-0000000001698CC1
        1: layer traversal: result Continue, cont_lsn 0/1698CC1, layer: inmem-0000000001698CC1-FFFFFFFFFFFFFFFF

    Stack backtrace:
2022-05-19 10:19:38 +03:00
Anastasia Lubennikova
cbd00d7ed9 Remove temp layer files during timeline initialization on pageserver start 2022-05-19 10:11:12 +03:00
Anastasia Lubennikova
4c30ae8ba3 Add random string as a part of tempfile name 2022-05-19 10:11:12 +03:00
Anastasia Lubennikova
3da4b3165e Fsync layer files before rename 2022-05-19 10:11:12 +03:00
Anastasia Lubennikova
c1b365fdf7 Use temp filename while writing ImageLayer file 2022-05-19 10:11:12 +03:00
Egor Suvorov
fab104d5f3 docs/sourcetree: add note about exact Python version used and how to choose it 2022-05-19 00:09:13 +02:00
Egor Suvorov
7dd27ecd20 Bump minimal supported Python version to 3.9
Most of the CI already run with Python 3.9 since https://github.com/neondatabase/docker-images/pull/1
2022-05-19 00:09:13 +02:00
Egor Suvorov
bd2979d02c CirleCI/check-codestyle-python: print versions 2022-05-19 00:09:13 +02:00
Dmitry Rodionov
5914aab78a add comments, use expect instead of unwrap 2022-05-19 00:54:14 +03:00
Heikki Linnakangas
4a36d89247 Avoid spawning a layer-flush thread when there's no work to do.
The check_checkpoint_distance() always spawned a new thread, even if
there is no frozen layer to flush. That was a thinko, as @knizhnik
pointed out.
2022-05-19 00:51:48 +03:00
Egor Suvorov
432907ff5f Safekeeper: avoid holding mutex when deleting a tenant (#1746)
Following discussion with @arssher after #1653
2022-05-18 23:02:17 +03:00
Arthur Petukhovsky
98da0aa159 Add _total suffix to metrics name (#1741) 2022-05-18 15:17:04 +03:00
Alexey Kondratov
772c2fb4ff Report startup metrics and failure reason from compute_ctl (#1581)
+ neondatabase/cloud#1103

This adds a couple of control endpoints to simplify compute state
discovery for control-plane. For example, now we may figure out
that Postgres wasn't able to start or basebackup failed within
seconds instead of just blindly polling the compute readiness
for a minute or two.

Also we now expose startup metrics (time of the each step: basebackup,
sync safekeepers, config, total). Console grabs them after each
successful start and report as histogram to prometheus and grafana.

OpenAPI spec is added and up-tp date, but is not currently used in the
console yet.
2022-05-18 13:03:29 +04:00
Andrey Taranik
b9f84f4a83 trun on storage deployment to neon-stress enviroment (#1729) 2022-05-17 23:04:04 +03:00
Arthur Petukhovsky
134eeeb096 Add more common storage metrics (#1722)
- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
2022-05-17 19:29:01 +03:00
Heikki Linnakangas
55ea3f262e Fix race condition leading to panic in remote storage sync thread.
The SyncQueue consisted of a tokio mpsc channel, and an atomic counter
to keep track of how many items there are in the channel. Updating the
atomic counter was racy, and sometimes the consumer would decrement
the counter before the producer had incremented it, leading to integer
wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads
to a panic.

To fix, replace the channel with a VecDeque protected by a Mutex, and
a condition variable for signaling. Now that the queue is now
protected by standard blocking Mutex and Condvar, refactor the
functions touching it to be sync, not async.

A theoretical downside of this is that the calls to push items to the
queue and the storage sync thread that drains the queue might now need
to wait, if another thread is busy manipulating the queue. I believe
that's OK; the lock isn't held for very long, and these operations are
made in background threads, not in the hot GetPage@LSN path, so
they're not very latency-sensitive.

Fixes #1719. Also add a test case.
2022-05-17 18:14:57 +03:00
Heikki Linnakangas
f03779bf1a Fix wait_for_last_record_lsn() and wait_for_upload() python functions.
The contract for wait_for() was not very clear. It waits until the
given function returns successfully, without an exception, but the
wait_for_last_record_lsn() and wait_for_upload() functions used "a <
b" as the condition, i.e. they thought that wait_for() would poll
until the function returns true.

Inline the logic from wait_for() into those two functions, it's not
that complicated, and you get a more specific error message too, if it
fails. Also add a comment to wait_for() to make it more clear how it
works.

Also change remote_consistent_lsn() to return 0 instead of raising an
exception, if remote is None. That can happen if nothing has been
uploaded to remote storage for the timeline yet. It happened once in
the CI, and I was able to reproduce that locally too by adding a sleep
to the storage sync thread, to delay the first upload.
2022-05-17 18:14:10 +03:00
Andrey Taranik
070c255522 Neon stress deploy (#1720)
* storage and proxy deployment for neon stress environment

* neon stress inventory fix
2022-05-17 18:03:01 +03:00
Heikki Linnakangas
9ccbb8d331 Make "neon_local stop" less verbose.
I got annoyed by all the noise in CI test output.

Before:

    $ ./target/release/neon_local stop
    Stop pageserver gracefully
    Pageserver still receives connections
    Pageserver stopped receiving connections
    Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)
    initializing for sk 1 for 7676
    Stop safekeeper gracefully
    Safekeeper still receives connections
    Safekeeper stopped receiving connections
    Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)

After:

    $ ./target/release/neon_local stop
    Stopping pageserver gracefully...done!
    Stopping safekeeper 1 gracefully...done!

Also removes the spurious "initializing for sk 1 for 7676" message from
"neon_local start"
2022-05-17 10:31:13 +03:00
Kirill Bulatov
f2881bbd8a Start and stop single etcd and mock s3 servers globally in python tests 2022-05-17 01:17:44 +03:00
Kirill Bulatov
a884f4cf6b Add etcd to neon_local 2022-05-17 01:17:44 +03:00
Kirill Bulatov
9a0fed0880 Enable at least 1 safekeeper in every test 2022-05-17 01:17:44 +03:00
chaitanya sharma
bea84150b2 Fix the markdown rendering on 004-durability.md RFC 2022-05-17 00:16:42 +03:00
chaitanya sharma
85b5c0e989 List profiling as a feature with 'pageserver --enabled-features'
Fixes https://github.com/neondatabase/neon/issues/1627
2022-05-16 21:10:57 +03:00
Thang Pham
e4a70faa08 Add more information to timeline-related APIs (#1673)
Resolves #1488.

- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn` 
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
2022-05-16 11:05:43 -04:00
chaitanya sharma
c41549f630 Update readme build for osx (#1709) 2022-05-16 10:42:08 -04:00
Heikki Linnakangas
c700032dd2 Run the regression tests in CI also for PRs opened from forked repos. 2022-05-16 14:40:49 +03:00
Kirill Bulatov
33cac863d7 Test simple.conf and handle broker_endpoints better 2022-05-16 12:07:35 +03:00
Heikki Linnakangas
51ea9c3053 Don't swallow panics when the pageserver is build with failpoints.
It's very confusing, and because you don't get a stack trace and error
message in the logs, makes debugging very hard. However, the
'test_pageserver_recovery' test relied on that behavior. To support that,
add a new "exit" action to the pageserver 'failpoints' command, so that
you can explicitly request to exit the process when a failpoint is hit.
2022-05-16 09:58:58 +03:00
Heikki Linnakangas
a10cac980f Continue with pageserver startup, if loading some tenants fail.
Fixes https://github.com/neondatabase/neon/issues/1664
2022-05-15 00:25:38 +03:00
Heikki Linnakangas
081d5dac5e Bump vendor/postgres.
Includes change to reduce log noise from inmem_smgr.
2022-05-13 21:41:00 +03:00
Andrey Taranik
cded72a580 remove sk-2 from staging inventory list (#1699) 2022-05-13 20:41:54 +03:00
Egor Suvorov
768c846eeb Fix test_delete_force from #1653 conflicting with #1692 2022-05-13 17:36:18 +02:00
Anastasia Lubennikova
a2561f0a78 Use tenant's pitr_interval instead of hardroded 0 in the command.
Adjust python tests that use the
2022-05-13 18:32:14 +03:00
Anastasia Lubennikova
aa7c601eca Fix pitr_interval check in GC:
Use timestamp->LSN mapping instead of file modification time.
Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff.
Add new test: test_pitr_gc
2022-05-13 18:32:14 +03:00
Egor Suvorov
bf899a57d9 Safekeeper: add timeline/tenant force delete HTTP endpoings (closes #895)
* There is no auth in Safekeeper HTTP at all currently,
  so simply calling `check_permission` is not enough.
* There are no checks of Safekeeper still working with the data,
  as "still working" is burry now: a timeline may be "active"
  while there are no compute nodes and all data is propagated.
* Still, callmemaybe is deactivated, and timeline is removed from the
  internal map. It can easily sneak back in case of race conditions
  and implicit creations, though.
2022-05-13 15:43:52 +02:00
Egor Suvorov
07b85e7cfc Safekeeper refactor: move callmemaybe_tx from SafekeeperPostgresBackend to Timeline 2022-05-13 15:43:52 +02:00
Egor Suvorov
22d997049c libs/utils/http/request: add ensure_no_body 2022-05-13 15:43:52 +02:00
Kirill Bulatov
b683308791 Return GIT_VERSION back to storage binaries 2022-05-13 16:34:32 +03:00
Kirill Bulatov
51c0f9ab2b Force git version to be up to date via decl macro 2022-05-13 16:34:32 +03:00
Stas Kelvich
0030da57a8 compute-tools: grant rw priveleges to the all created users 2022-05-13 11:27:00 +03:00
Kirill Bulatov
85884a1599 Disable tenant relocation python test 2022-05-13 01:26:38 +03:00
Thang Pham
ae20751724 update ZenithCli::create_tenant return signature (#1692)
to include the initial timeline's ID in addition to the new tenant's ID.

Context: follow-up of https://github.com/neondatabase/neon/pull/1689
2022-05-12 17:27:08 -04:00
Thang Pham
5812e26b90 Create an initial timeline on CLI tenant creation (#1689)
Resolves #1655
2022-05-12 16:33:09 -04:00
Arthur Petukhovsky
ec8861b8cc Fix pageserver metrics names (#1682)
Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically:
- Use `pageserver_` prefix for all pagserver metrics
- Specify `_seconds` unit in time metrics
- Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records`
- Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)
2022-05-12 19:53:07 +03:00
Kirill Bulatov
4538f1e1b8 Correctly operate etcd safekeeper timeline data 2022-05-12 18:47:31 +03:00
Stas Kelvich
b10ae195b7 Set vendor/postgres back to the main branch
I accidentally merged postgres PR that was referencing non-main branch.
2022-05-12 15:05:49 +03:00
Alexey Kondratov
b426775aa0 Use compute-tools from the new neondatabase Docker Hub repo 2022-05-12 12:26:24 +03:00
Heikki Linnakangas
5da4f3a4df Refactor DeltaLayer::dump() function
Put most of the code in a closure that returns Result, so that we can
use the ?-operator for error handling. That's simpler.
2022-05-12 10:31:04 +03:00
Konstantin Knizhnik
2bde77fced Do not apply records with LSN smaller than LSN of cached image in del… (#1672)
* Do not apply records with LSN smaller than LSN of cached image in delta layer

* Do not apply records with LSN smaller than LSN of cached image in delta layer
2022-05-12 07:56:02 +03:00
Dhammika Pathirana
c864091035 Fix err msg typo
Signed-off-by: Dhammika Pathirana <dham@neon.tech>
2022-05-11 16:13:26 -07:00
Anton Shyrabokau
20361395bb Add zenith-us-stage-sk-5 to circleci inventory (#1665)
Co-authored-by: Debian <admin@ip-10-0-5-32.us-west-2.compute.internal>
2022-05-11 21:36:53 +03:00
Arseny Sher
b338b5dffe Make callmemaybe less agressive until we fix it/migrate to bigger machines. 2022-05-11 22:16:13 +04:00
Stas Kelvich
5bd879f641 Proxy: update protocol after cluster->project rename 2022-05-11 15:50:36 +03:00
Konstantin Knizhnik
e6e883eb12 Do not set LSN for new FPI page (#1657)
* Do not set LSN for new FPI page

refer #1656

* Add page_is_new, page_get_lsn, page_set_lsn functions

* Fix page_is_new implementation

* Add comment from XLogReadBufferForRedoExtended
2022-05-11 15:23:17 +03:00
Heikki Linnakangas
d710dff975 Remove unnecessary Serialize/Deserialize traits from VecMap.
It's never stored on disk. Let's be tidy.
2022-05-10 23:47:40 +03:00
Arseny Sher
6cb14b4200 Optionally remove WAL on safekeepers without s3 offloading.
And do that on staging, until offloading is merged.
2022-05-10 22:41:02 +04:00
Thang Pham
87dfa99734 Update layered_repository REAMDE (#1659) 2022-05-10 09:55:14 -04:00
Thang Pham
cf59b51519 Update README (Running local installation section) (#1649) 2022-05-09 11:11:46 -04:00
Kirill Bulatov
0a7735a656 Rework remote storage sync queue, general refactoring 2022-05-07 01:33:33 +03:00
Kirill Bulatov
64a602b8f3 Delete timeline layers 2022-05-07 01:33:33 +03:00
Kirill Bulatov
10e4da3997 Rework timeline batching 2022-05-07 01:33:33 +03:00
Kirill Bulatov
de37f982db Share the remote storage as a crate 2022-05-07 00:30:36 +03:00
Kirill Bulatov
d4e155aaa3 Librarify common etcd timeline logic 2022-05-06 22:32:57 +03:00
Arseny Sher
dd6dca9072 Bump vendor/postgres to shut down on wrong basebackup. 2022-05-06 20:07:26 +04:00
bojanserafimov
ef40e404cf Rename zenith crate to neon_local (#1625) 2022-05-05 19:06:53 -04:00
Sergey Melnikov
11a44eda0e Add TLS support in scram-proxy (#1643)
* Add TLS support in scram-proxy

* Fix authEndpoint
2022-05-05 23:48:16 +03:00
Heikki Linnakangas
30a7598172 Some copy-editing. 2022-05-05 22:35:15 +03:00
Heikki Linnakangas
1ad5658d9c Fix typos 2022-05-05 22:35:15 +03:00
Dmitry Rodionov
954859f6c5 add readme for performance tests with the current state of things 2022-05-05 22:35:15 +03:00
Andrey Taranik
4024bfe736 get_binaries script fix (#1638)
* get_binaries script fix

* minor improvment for get_binaries
2022-05-05 22:21:07 +03:00
Kirill Bulatov
2ef0e5c6ed Do not require metadata in every upload sync task 2022-05-05 18:26:39 +03:00
Kirill Bulatov
52a7e3155e Add local path to the Layer trait and historic layers 2022-05-05 18:26:39 +03:00
Thang Pham
ad5eaa6027 Use node's LSN for read-only nodes (#1642)
Fixes #1410.
2022-05-05 10:53:10 -04:00
Dmitry Rodionov
0f3ec83172 avoid detach with alive branches 2022-05-05 12:54:42 +03:00
Arseny Sher
c46fe90010 Fix division by zero in WAL removal. 2022-05-05 10:41:43 +04:00
bojanserafimov
bc569dde51 Remove some unwraps from waldecoder (#1539) 2022-05-04 17:41:05 -04:00
bojanserafimov
02e5083695 Add hot page test (#1479) 2022-05-04 12:45:01 -04:00
Thang Pham
c4bc604e5f Fix pg list table alignment #1633
Fixes #1628

- add [`comfy_table`](https://github.com/Nukesor/comfy-table/tree/main) and use it to construct table for `pg list` CLI command

Comparison

- Old:

```
NODE	ADDRESS	TIMELINE	BRANCH NAME	LSN		STATUS
main	127.0.0.1:55432	3823dd05e35d71f6ccf33049de366d70	main	0/16FB140	running
migration_check	127.0.0.1:55433	3823dd05e35d71f6ccf33049de366d70	main	0/16FB140	running
```

- New:

```
 NODE             ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS
 main             127.0.0.1:55432  3823dd05e35d71f6ccf33049de366d70  main         0/16FB140  running
 migration_check  127.0.0.1:55433  3823dd05e35d71f6ccf33049de366d70  main         0/16FB140  running
```
2022-05-04 12:12:26 -04:00
Anastasia Lubennikova
b8880bfaab Bump vendor/postgres 2022-05-04 18:14:45 +03:00
Anastasia Lubennikova
e2cf77441d Implement pg_database_size().
In this implementation dbsize equals sum of all relation sizes, excluding shared ones.
2022-05-04 18:14:45 +03:00
Arseny Sher
b68e3b03ed Fix control file update for b9fd8a36ad 2022-05-04 17:11:22 +04:00
Arseny Sher
e58c83870f Bump vendor/postgres to to send timeline_start_lsn. 2022-05-04 14:32:03 +04:00
Arseny Sher
b9fd8a36ad Remember timeline_start_lsn and local_start_lsn on safekeeper.
Make it remember when timeline starts in general and on this safekeeper in
particular (the point might be later on new safekeeper replacing failed one).

Bumps control file and walproposer protocol versions.

While protocol is bumped, also add safekeeper node id to
AcceptorProposerGreeting.

ref #1561
2022-05-04 14:32:03 +04:00
Heikki Linnakangas
748c5a577b Bump vendor/postgres. (#1616)
Includes fix for https://github.com/neondatabase/neon/issues/1615
2022-05-04 10:54:44 +03:00
Stas Kelvich
51a0f2683b fix scram-proxy addresses 2022-05-04 01:35:30 +03:00
Dmitry Rodionov
9dfa145c7c tone down tenant not found error 2022-05-04 00:47:52 +03:00
Stas Kelvich
5642d0b2b8 Change shutdown_process_on_error thread spawn settings.
Now princeple is following: acceptor threads (libpq and http) error will
bring the pageserver down, but all per-tenant thread failures will be treated
as an error.
2022-05-04 00:42:57 +03:00
Dmitry Rodionov
2f83f793bc print more details when thread fails 2022-05-03 18:31:23 +03:00
Anastasia Lubennikova
2f9b17b9e5 Add simple test of pageserver recovery after crash. To cause a crash, use failpoints in checkpointer 2022-05-03 17:13:09 +03:00
Dmitry Rodionov
e7cba0b607 use thiserror instead of anyhow in disk_btree 2022-05-03 15:34:23 +03:00
Dmitry Rodionov
ff7e9a86c6 turn panic into an error with more details 2022-05-03 12:44:42 +03:00
Heikki Linnakangas
9ede38b6c4 Support finding LSN from a commit timestamp.
A new `get_lsn_by_timestamp` command is added to the libpq page service
API.

An extra timestamp field is now stored in an extra field after each
Clog page. It is the timestamp of the latest commit, among all the
transactions on the Clog page. To find the overall latest commit, we
need to scan all Clog pages, but this isn't a very frequent operation
so that's not too bad.

To find the LSN that corresponds to a timestamp, we perform a binary
search. The binary search starts with min = last LSN when GC ran, and
max = latest LSN on the timeline. On each iteration of the search we
check if there are any commits with a higher-than-requested timestamp
at that LSN.

Implements github issue 1361.
2022-05-03 09:28:57 +03:00
Heikki Linnakangas
62449d6068 Bump vendor/postgres (#1573)
This brings us the performance improvements to WAL redo from
https://github.com/neondatabase/postgres/pull/144
2022-05-03 09:25:12 +03:00
Konstantin Knizhnik
baa59512b8 Traverse frozen layer in get_reconstruct_data in reverse order (#1601)
* Traverse frozen layer in get_reconstruct_data in reverse order

* Fix comments on frozen layers.

Note explicitly the order that the layers are in the queue.

* Add fail point to reproduce failpoint iteration error

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2022-05-03 08:07:14 +03:00
Heikki Linnakangas
87a6c4d051 RFC on connection routing and authentication.
This documents how we want this to work. We're not quite there yet.
2022-05-02 23:39:06 +03:00
Stas Kelvich
801b749e1d Set correct authEndpoint for the new proxy 2022-05-02 21:46:32 +03:00
Kirill Bulatov
5cb501c2b3 Make remote storage test less flacky 2022-05-02 20:04:48 +03:00
Dmitry Rodionov
ad25736f3a Exit pageserver process with correct error code
When we shutdown pageserver due to an error (e g one of th important
thrads panicked) use 1 exit code so systemd can properly restart it
2022-05-02 19:04:45 +03:00
Stas Kelvich
9a396e1feb Support SNI-based routing in proxy 2022-05-02 18:32:18 +03:00
Stas Kelvich
0323bb5870 [proxy] Refactor cplane API and add new console SCRAM auth API
Now proxy binary accepts `--auth-backend` CLI option, which determines
auth scheme and cluster routing method. Following backends are currently
implemented:

* legacy
    old method, when username ends with `@zenith` it uses md5 auth dbname as
    the cluster name; otherwise, it sends a login link and waits for the console
    to call back
* console
    new SCRAM-based console API; uses SNI info to select the destination
    cluster
* postgres
    uses postgres to select auth secrets of existing roles. Useful for local
    testing
* link
    sends login link for all usernames
2022-05-02 18:32:18 +03:00
Dmitry Ivanov
af0195b604 [proxy] Introduce cloud::Api for communication with Neon Cloud
* `cloud::legacy` talks to Cloud API V1.
* `cloud::api` defines Cloud API v2.
* `cloud::local` mocks the Cloud API V2 using a local postgres instance.
* It's possible to choose between API versions using the `--api-version` flag.
2022-05-02 18:32:18 +03:00
Dmitry Ivanov
9df8915b03 [proxy] sasl::Mechanism may return Output during exchange
This is needed to forward the `ClientKey` that's required
to connect the proxy to a compute.

Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
2022-05-02 18:32:18 +03:00
Dmitry Ivanov
4b1bd32e4a Drop Debug impl for ScramKey and ServerSecret
There's a notion that accidental misuse of those implementations
might reveal authentication secrets.
2022-05-02 18:32:18 +03:00
Andrey Taranik
68ba6a58a0 authEndpoint fix 2022-05-02 17:55:13 +03:00
Andrey Taranik
8f479a712f minor fixes in proxy deployment 2022-05-02 17:55:13 +03:00
Stas Kelvich
2477d2f9e2 Deploy standalone SRAM proxy on staging 2022-05-02 17:55:13 +03:00
Dhammika Pathirana
992874c916 Fix update ps settings doc
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:52:08 -07:00
Dhammika Pathirana
3128e8c75c Fix tenant conf test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:13:25 -07:00
Dhammika Pathirana
f3f12db2cb Add gc churn threshold knob (#1594)
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:13:17 -07:00
Andrey Taranik
038ea4c128 proxy notice message update (#1600) 2022-04-30 22:04:08 +03:00
Kirill Bulatov
7e1db8c8a1 Show which virtual file got the deserialization errors 2022-04-29 21:40:57 +03:00
Andrey Taranik
aa933d3961 proxy settings update for new domain (#1597) 2022-04-29 20:05:14 +03:00
Dmitry Rodionov
67b4e38092 remporarily disable test_backpressure_received_lsn_lag 2022-04-29 15:53:56 +03:00
Dmitry Rodionov
05f8e6a050 Use fsync+rename for atomic downloads from remote storage
Use failpoint in test_remote_storage to check the behavior
2022-04-29 15:53:56 +03:00
chaitanya sharma
76388abeb6 Rename READMEs with .md extension, and fix links to them.
Commit edba2e97 renamed pageserver/README to pageserver/README.md, but
forgot to update links to it. Fix.

Rename libs/postgres_ffi/README and safekeeper/README files to also
have the the .md extension, so that github can render them nicely.

Quote ascii-diagram in safekeeper/README.md so that it renders
correctly.
2022-04-29 14:23:42 +03:00
Kirill Bulatov
2911eb084a Remove timeline files on detach 2022-04-29 09:19:18 +03:00
Kirill Bulatov
6cca57f95a Properly remove from the local timeline map 2022-04-29 09:19:18 +03:00
Kirill Bulatov
4a46b01caf Properly populate local timeline map 2022-04-29 09:19:18 +03:00
Anastasia Lubennikova
5c5c3c64f3 Fix tenant config parsing. Add a test 2022-04-28 11:49:19 +03:00
Alexey Kondratov
8d5df07cce Merge pull request #1385 from zenithdb/main
Release main 2022-03-22
2022-03-22 05:04:34 -05:00
Andrey Taranik
df7a9d1407 release fix 2022-03-16 (#1375) 2022-03-17 00:43:28 +03:00
211 changed files with 11716 additions and 6116 deletions

View File

@@ -1,2 +1,4 @@
zenith_install.tar.gz
.zenith_current_version
neon_install.tar.gz
.neon_current_version

View File

@@ -3,27 +3,8 @@
set -e
RELEASE=${RELEASE:-false}
# look at docker hub for latest tag for neon docker image
if [ "${RELEASE}" = "true" ]; then
echo "search latest relase tag"
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1)
if [ -z "${VERSION}" ]; then
echo "no any docker tags found, exiting..."
exit 1
else
TAG="release-${VERSION}"
fi
else
echo "search latest dev tag"
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep -v release | tail -1)
if [ -z "${VERSION}" ]; then
echo "no any docker tags found, exiting..."
exit 1
else
TAG="${VERSION}"
fi
fi
VERSION="1"
TAG="backport-f88fe021-import-patch-${VERSION}"
echo "found ${VERSION}"

View File

@@ -0,0 +1,19 @@
[pageservers]
neon-stress-ps-1 console_region_id=1
neon-stress-ps-2 console_region_id=1
[safekeepers]
neon-stress-sk-1 console_region_id=1
neon-stress-sk-2 console_region_id=1
neon-stress-sk-3 console_region_id=1
[storage:children]
pageservers
safekeepers
[storage:vars]
console_mgmt_base_url = http://neon-stress-console.local
bucket_name = neon-storage-ireland
bucket_region = eu-west-1
etcd_endpoints = etcd-stress.local:2379
safekeeper_enable_s3_offload = false

View File

@@ -1,10 +1,7 @@
[pageservers]
zenith-1-ps-1 console_region_id=1
zenith-1-ps-3 console_region_id=1
[safekeepers]
zenith-1-sk-1 console_region_id=1
zenith-1-sk-2 console_region_id=1
zenith-1-sk-3 console_region_id=1
[storage:children]
pageservers
@@ -15,3 +12,4 @@ console_mgmt_base_url = http://console-release.local
bucket_name = zenith-storage-oregon
bucket_region = us-west-2
etcd_endpoints = etcd-release.local:2379
safekeeper_enable_s3_offload = false

View File

@@ -4,8 +4,9 @@ zenith-us-stage-ps-2 console_region_id=27
[safekeepers]
zenith-us-stage-sk-1 console_region_id=27
zenith-us-stage-sk-2 console_region_id=27
zenith-us-stage-sk-4 console_region_id=27
zenith-us-stage-sk-5 console_region_id=27
zenith-us-stage-sk-6 console_region_id=27
[storage:children]
pageservers
@@ -16,3 +17,4 @@ console_mgmt_base_url = http://console-staging.local
bucket_name = zenith-staging-storage-us-east-1
bucket_region = us-east-1
etcd_endpoints = etcd-staging.local:2379
safekeeper_enable_s3_offload = false

View File

@@ -6,7 +6,7 @@ After=network.target auditd.service
Type=simple
User=pageserver
Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/pageserver LD_LIBRARY_PATH=/usr/local/lib
ExecStart=/usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /storage/pageserver/data
ExecStart=/usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -c "broker_endpoints=['{{ etcd_endpoints }}']" -D /storage/pageserver/data
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
KillSignal=SIGINT

View File

@@ -6,7 +6,7 @@ After=network.target auditd.service
Type=simple
User=safekeeper
Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/safekeeper/data LD_LIBRARY_PATH=/usr/local/lib
ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data --broker-endpoints={{ etcd_endpoints }}
ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data --broker-endpoints={{ etcd_endpoints }} --enable-s3-offload={{ safekeeper_enable_s3_offload }}
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
KillSignal=SIGINT

View File

@@ -121,7 +121,7 @@ jobs:
export RUSTC_WRAPPER=cachepot
export AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}"
"${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests
"${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --features failpoints --bins --tests
cachepot -s
- save_cache:
@@ -222,6 +222,12 @@ jobs:
key: v2-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
name: Print versions
when: always
command: |
poetry run python --version
poetry show
- run:
name: Run yapf to ensure code format
when: always
@@ -355,7 +361,7 @@ jobs:
when: always
command: |
du -sh /tmp/test_output/*
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete
find /tmp/test_output -type f ! -name "*.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" ! -name "*.metrics" -delete
du -sh /tmp/test_output/*
- store_artifacts:
path: /tmp/test_output
@@ -456,9 +462,6 @@ jobs:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Build neondatabase/compute-tools:latest image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run:
name: Build and push compute-tools Docker image
command: |
@@ -466,7 +469,10 @@ jobs:
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag neondatabase/compute-tools:latest -f Dockerfile.compute-tools .
--tag neondatabase/compute-tools:local \
--tag neondatabase/compute-tools:latest \
-f Dockerfile.compute-tools .
# Only push :latest image
docker push neondatabase/compute-tools:latest
- run:
name: Init postgres submodule
@@ -476,7 +482,9 @@ jobs:
command: |
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG=$(git log --oneline|wc -l)
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:latest vendor/postgres
docker build --tag neondatabase/compute-node:${DOCKER_TAG} \
--tag neondatabase/compute-node:latest vendor/postgres \
--build-arg COMPUTE_TOOLS_TAG=local
docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:latest
@@ -495,15 +503,14 @@ jobs:
name: Build and push Docker image
command: |
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG="release-$(git log --oneline|wc -l)"
DOCKER_TAG="backport-f88fe021-import-patch-1"
docker build \
--pull \
--build-arg GIT_VERSION=${CIRCLE_SHA1} \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag neondatabase/neon:${DOCKER_TAG} --tag neondatabase/neon:release .
--tag neondatabase/neon:${DOCKER_TAG} .
docker push neondatabase/neon:${DOCKER_TAG}
docker push neondatabase/neon:release
# Build production neondatabase/compute-node:release image and push it to Docker hub
docker-image-compute-release:
@@ -513,18 +520,19 @@ jobs:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Build neondatabase/compute-tools:release image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run:
name: Build and push compute-tools Docker image
command: |
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag neondatabase/compute-tools:release -f Dockerfile.compute-tools .
docker push neondatabase/compute-tools:release
# don't build new compute image
# docker build \
# --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
# --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
# --tag neondatabase/compute-tools:release \
# --tag neondatabase/compute-tools:local \
# -f Dockerfile.compute-tools .
# # Only push :release image
# docker push neondatabase/compute-tools:release
- run:
name: Init postgres submodule
command: git submodule update --init --depth 1
@@ -533,9 +541,12 @@ jobs:
command: |
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG="release-$(git log --oneline|wc -l)"
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:release vendor/postgres
docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:release
# don't build new compute image
# docker build --tag neondatabase/compute-node:${DOCKER_TAG} \
# --tag neondatabase/compute-node:release vendor/postgres \
# --build-arg COMPUTE_TOOLS_TAG=local
# docker push neondatabase/compute-node:${DOCKER_TAG}
# docker push neondatabase/compute-node:release
deploy-staging:
docker:
@@ -579,13 +590,63 @@ jobs:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add zenithdb https://neondatabase.github.io/helm-charts
helm repo add neondatabase https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
DOCKER_TAG=$(git log --oneline|wc -l)
helm upgrade zenith-proxy zenithdb/zenith-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait
helm upgrade neon-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait
helm upgrade neon-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait
deploy-neon-stress:
docker:
- image: cimg/python:3.10
steps:
- checkout
- setup_remote_docker
- run:
name: Setup ansible
command: |
pip install --progress-bar off --user ansible boto3
- run:
name: Redeploy
command: |
cd "$(pwd)/.circleci/ansible"
./get_binaries.sh
echo "${TELEPORT_SSH_KEY}" | tr -d '\n'| base64 --decode >ssh-key
echo "${TELEPORT_SSH_CERT}" | tr -d '\n'| base64 --decode >ssh-key-cert.pub
chmod 0600 ssh-key
ssh-add ssh-key
rm -f ssh-key ssh-key-cert.pub
ansible-playbook deploy.yaml -i neon-stress.hosts
rm -f neon_install.tar.gz .neon_current_version
deploy-neon-stress-proxy:
docker:
- image: cimg/base:2021.04
environment:
KUBECONFIG: .kubeconfig
steps:
- checkout
- run:
name: Store kubeconfig file
command: |
echo "${NEON_STRESS_KUBECONFIG_DATA}" | base64 --decode > ${KUBECONFIG}
chmod 0600 ${KUBECONFIG}
- run:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add neondatabase https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
DOCKER_TAG=$(git log --oneline|wc -l)
helm upgrade neon-stress-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/neon-stress.proxy.yaml --set image.tag=${DOCKER_TAG} --wait
helm upgrade neon-stress-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/neon-stress.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait
deploy-release:
docker:
@@ -620,21 +681,7 @@ jobs:
KUBECONFIG: .kubeconfig
steps:
- checkout
- run:
name: Store kubeconfig file
command: |
echo "${PRODUCTION_KUBECONFIG_DATA}" | base64 --decode > ${KUBECONFIG}
chmod 0600 ${KUBECONFIG}
- run:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add zenithdb https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
DOCKER_TAG="release-$(git log --oneline|wc -l)"
helm upgrade zenith-proxy zenithdb/zenith-proxy --install -f .circleci/helm-values/production.proxy.yaml --set image.tag=${DOCKER_TAG} --wait
# don't deploy proxy
# Trigger a new remote CI job
remote-ci-trigger:
@@ -771,14 +818,33 @@ workflows:
requires:
- docker-image
- docker-image-release:
- deploy-neon-stress:
# Context gives an ability to login
context: Docker Hub
# Build image only for commits to main
# deploy only for commits to main
filters:
branches:
only:
- release
- main
requires:
- docker-image
- deploy-neon-stress-proxy:
# deploy only for commits to main
filters:
branches:
only:
- main
requires:
- docker-image
- docker-image-release:
# Context gives an ability to login
context: Docker Hub
# Build image only for commits to f88fe021-import-patch
filters:
branches:
only:
- f88fe021-import-patch
requires:
- pg_regress-tests-release
- other-tests-release
@@ -796,11 +862,11 @@ workflows:
- deploy-release:
# Context gives an ability to login
context: Docker Hub
# deploy only for commits to main
# deploy only for commits to f88fe021-import-patch
filters:
branches:
only:
- release
- f88fe021-import-patch
requires:
- docker-image-release
- deploy-release-proxy:

View File

@@ -0,0 +1,26 @@
fullnameOverride: "neon-stress-proxy-scram"
settings:
authBackend: "console"
authEndpoint: "http://neon-stress-console.local/management/api/v2"
domain: "*.stress.neon.tech"
podLabels:
zenith_service: proxy-scram
zenith_env: staging
zenith_region: eu-west-1
zenith_region_slug: ireland
exposedService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: '*.stress.neon.tech'
metrics:
enabled: true
serviceMonitor:
enabled: true
selector:
release: kube-prometheus-stack

View File

@@ -0,0 +1,34 @@
fullnameOverride: "neon-stress-proxy"
settings:
authEndpoint: "https://console.dev.neon.tech/authenticate_proxy_request/"
uri: "https://console.dev.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
zenith_service: proxy
zenith_env: staging
zenith_region: eu-west-1
zenith_region_slug: ireland
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internal
external-dns.alpha.kubernetes.io/hostname: neon-stress-proxy.local
type: LoadBalancer
exposedService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: connect.dev.neon.tech
metrics:
enabled: true
serviceMonitor:
enabled: true
selector:
release: kube-prometheus-stack

View File

@@ -0,0 +1,24 @@
settings:
authBackend: "console"
authEndpoint: "http://console-release.local/management/api/v2"
domain: "*.cloud.neon.tech"
podLabels:
zenith_service: proxy-scram
zenith_env: production
zenith_region: us-west-2
zenith_region_slug: oregon
exposedService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: '*.cloud.neon.tech'
metrics:
enabled: true
serviceMonitor:
enabled: true
selector:
release: kube-prometheus-stack

View File

@@ -1,12 +1,6 @@
# Helm chart values for zenith-proxy.
# This is a YAML-formatted file.
image:
repository: neondatabase/neon
settings:
authEndpoint: "https://console.zenith.tech/authenticate_proxy_request/"
uri: "https://console.zenith.tech/psql_session/"
authEndpoint: "https://console.neon.tech/authenticate_proxy_request/"
uri: "https://console.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
@@ -28,7 +22,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: start.zenith.tech
external-dns.alpha.kubernetes.io/hostname: connect.neon.tech,pg.neon.tech
metrics:
enabled: true

View File

@@ -0,0 +1,31 @@
# Helm chart values for zenith-proxy.
# This is a YAML-formatted file.
image:
repository: neondatabase/neon
settings:
authBackend: "console"
authEndpoint: "http://console-staging.local/management/api/v2"
domain: "*.cloud.stage.neon.tech"
# -- Additional labels for zenith-proxy pods
podLabels:
zenith_service: proxy-scram
zenith_env: staging
zenith_region: us-east-1
zenith_region_slug: virginia
exposedService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: cloud.stage.neon.tech
metrics:
enabled: true
serviceMonitor:
enabled: true
selector:
release: kube-prometheus-stack

View File

@@ -5,8 +5,8 @@ image:
repository: neondatabase/neon
settings:
authEndpoint: "https://console.stage.zenith.tech/authenticate_proxy_request/"
uri: "https://console.stage.zenith.tech/psql_session/"
authEndpoint: "https://console.stage.neon.tech/authenticate_proxy_request/"
uri: "https://console.stage.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
@@ -20,7 +20,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: start.stage.zenith.tech
external-dns.alpha.kubernetes.io/hostname: connect.stage.neon.tech
metrics:
enabled: true

View File

@@ -1,6 +1,8 @@
name: Build and Test
on: push
on:
pull_request:
push:
jobs:
regression-check:
@@ -32,11 +34,11 @@ jobs:
if: matrix.os == 'ubuntu-latest'
run: |
sudo apt update
sudo apt install build-essential libreadline-dev zlib1g-dev flex bison libseccomp-dev
sudo apt install build-essential libreadline-dev zlib1g-dev flex bison libseccomp-dev libssl-dev
- name: Install macOs postgres dependencies
- name: Install macOS postgres dependencies
if: matrix.os == 'macos-latest'
run: brew install flex bison
run: brew install flex bison openssl
- name: Set pg revision for caching
id: pg_ver
@@ -50,10 +52,27 @@ jobs:
tmp_install/
key: ${{ runner.os }}-pg-${{ steps.pg_ver.outputs.pg_rev }}
- name: Set extra env for macOS
if: matrix.os == 'macos-latest'
run: |
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Build postgres
if: steps.cache_pg.outputs.cache-hit != 'true'
run: make postgres
# Plain configure output can contain weird errors like 'error: C compiler cannot create executables'
# and the real cause will be inside config.log
- name: Print configure logs in case of failure
if: failure()
continue-on-error: true
run: |
echo '' && echo '=== config.log ===' && echo ''
cat tmp_install/build/config.log
echo '' && echo '=== configure.log ===' && echo ''
cat tmp_install/build/configure.log
- name: Cache cargo deps
id: cache_cargo
uses: actions/cache@v2

3
.gitignore vendored
View File

@@ -11,3 +11,6 @@ test_output/
# Coverage
*.profraw
*.profdata
*.key
*.crt

View File

@@ -1,20 +0,0 @@
This software is licensed under the Apache 2.0 License:
----------------------------------------------------------------------------
Copyright 2021 Zenith Labs, Inc
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
----------------------------------------------------------------------------
The PostgreSQL submodule in vendor/postgres is licensed under the
PostgreSQL license. See vendor/postgres/COPYRIGHT.

433
Cargo.lock generated
View File

@@ -113,6 +113,49 @@ version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa"
[[package]]
name = "axum"
version = "0.5.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f4af7447fc1214c1f3a1ace861d0216a6c8bb13965b64bbad9650f375b67689a"
dependencies = [
"async-trait",
"axum-core",
"bitflags",
"bytes",
"futures-util",
"http",
"http-body",
"hyper",
"itoa 1.0.1",
"matchit",
"memchr",
"mime",
"percent-encoding",
"pin-project-lite",
"serde",
"sync_wrapper",
"tokio",
"tower",
"tower-http",
"tower-layer",
"tower-service",
]
[[package]]
name = "axum-core"
version = "0.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3bdc19781b16e32f8a7200368a336fa4509d4b72ef15dd4e41df5290855ee1e6"
dependencies = [
"async-trait",
"bytes",
"futures-util",
"http",
"http-body",
"mime",
]
[[package]]
name = "backtrace"
version = "0.3.64"
@@ -123,7 +166,7 @@ dependencies = [
"cc",
"cfg-if",
"libc",
"miniz_oxide",
"miniz_oxide 0.4.4",
"object",
"rustc-demangle",
]
@@ -307,41 +350,36 @@ dependencies = [
[[package]]
name = "clap"
version = "3.1.12"
version = "3.0.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7c167e37342afc5f33fd87bbc870cedd020d2a6dffa05d45ccd9241fbdd146db"
checksum = "b63edc3f163b3c71ec8aa23f9bd6070f77edbf3d1d198b164afa90ff00e4ec62"
dependencies = [
"atty",
"bitflags",
"clap_derive",
"clap_lex",
"indexmap",
"lazy_static",
"os_str_bytes",
"strsim 0.10.0",
"termcolor",
"textwrap 0.15.0",
"textwrap 0.14.2",
]
[[package]]
name = "clap_derive"
version = "3.1.7"
name = "close_fds"
version = "0.3.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a3aab4734e083b809aaf5794e14e756d1c798d2c69c7f7de7a09a2f5214993c1"
checksum = "3bc416f33de9d59e79e57560f450d21ff8393adcf1cdfc3e6d8fb93d5f88a2ed"
dependencies = [
"heck 0.4.0",
"proc-macro-error",
"proc-macro2",
"quote",
"syn",
"cfg-if",
"libc",
]
[[package]]
name = "clap_lex"
version = "0.1.1"
name = "cmake"
version = "0.1.48"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "189ddd3b5d32a70b35e7686054371742a937b0d99128e76dde6340210e966669"
checksum = "e8ad8cef104ac57b68b89df3208164d228503abbdce70f6880ffa3d970e7443a"
dependencies = [
"os_str_bytes",
"cc",
]
[[package]]
@@ -354,13 +392,25 @@ dependencies = [
"memchr",
]
[[package]]
name = "comfy-table"
version = "5.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b103d85ca6e209388771bfb7aa6b68a7aeec4afbf6f0a0264bfbf50360e5212e"
dependencies = [
"crossterm",
"strum",
"strum_macros",
"unicode-width",
]
[[package]]
name = "compute_tools"
version = "0.1.0"
dependencies = [
"anyhow",
"chrono",
"clap 3.1.12",
"clap 3.0.14",
"env_logger",
"hyper",
"libc",
@@ -550,6 +600,31 @@ dependencies = [
"lazy_static",
]
[[package]]
name = "crossterm"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a2102ea4f781910f8a5b98dd061f4c2023f479ce7bb1236330099ceb5a93cf17"
dependencies = [
"bitflags",
"crossterm_winapi",
"libc",
"mio",
"parking_lot 0.12.0",
"signal-hook",
"signal-hook-mio",
"winapi",
]
[[package]]
name = "crossterm_winapi"
version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2ae1b35a484aa10e07fe0638d02301c5ad24de82d310ccbd2f3693da5f09bf1c"
dependencies = [
"winapi",
]
[[package]]
name = "crypto-common"
version = "0.1.3"
@@ -717,9 +792,9 @@ dependencies = [
[[package]]
name = "etcd-client"
version = "0.8.4"
version = "0.9.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "585de5039d1ecce74773db49ba4e8107e42be7c2cd0b1a9e7fce27181db7b118"
checksum = "c434d2800b273a506b82397aad2f20971636f65e47b27c027f77d498530c5954"
dependencies = [
"http",
"prost",
@@ -727,9 +802,26 @@ dependencies = [
"tokio-stream",
"tonic",
"tonic-build",
"tower",
"tower-service",
]
[[package]]
name = "etcd_broker"
version = "0.1.0"
dependencies = [
"etcd-client",
"regex",
"serde",
"serde_json",
"serde_with",
"thiserror",
"tokio",
"tracing",
"utils",
"workspace_hack",
]
[[package]]
name = "fail"
version = "0.5.0"
@@ -786,6 +878,18 @@ version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "279fb028e20b3c4c320317955b77c5e0c9701f05a1d309905d6fc702cdc5053e"
[[package]]
name = "flate2"
version = "1.0.23"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b39522e96686d38f4bc984b9198e3a0613264abaebaff2c5c918bfa6b6da09af"
dependencies = [
"cfg-if",
"crc32fast",
"libc",
"miniz_oxide 0.5.1",
]
[[package]]
name = "fnv"
version = "1.0.7"
@@ -1085,6 +1189,12 @@ dependencies = [
"pin-project-lite",
]
[[package]]
name = "http-range-header"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0bfe8eed0a9285ef776bb792479ea3834e8b94e13d615c2f66d03dd50a435a29"
[[package]]
name = "httparse"
version = "1.6.0"
@@ -1103,6 +1213,16 @@ version = "2.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9a3a5bfb195931eeb336b2a7b4d761daec841b97f947d34394601737a7bba5e4"
[[package]]
name = "humantime-serde"
version = "1.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "57a3db5ea5923d99402c94e9feb261dc5ee9b4efa158b0315f788cf549cc200c"
dependencies = [
"humantime",
"serde",
]
[[package]]
name = "hyper"
version = "0.14.17"
@@ -1340,6 +1460,12 @@ version = "0.1.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a3e378b66a060d48947b590737b30a1be76706c8dd7b8ba0f2fe3989c68a853f"
[[package]]
name = "matchit"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "73cbba799671b762df5a175adf59ce145165747bb891505c43d09aefbbf38beb"
[[package]]
name = "md-5"
version = "0.9.1"
@@ -1423,6 +1549,15 @@ dependencies = [
"autocfg",
]
[[package]]
name = "miniz_oxide"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d2b29bd4bc3f33391105ebee3589c19197c4271e3e5a9ec9bfe8127eeff8f082"
dependencies = [
"adler",
]
[[package]]
name = "mio"
version = "0.8.2"
@@ -1470,6 +1605,24 @@ dependencies = [
"tempfile",
]
[[package]]
name = "neon_local"
version = "0.1.0"
dependencies = [
"anyhow",
"clap 3.0.14",
"comfy-table",
"control_plane",
"git-version",
"pageserver",
"postgres",
"postgres_ffi",
"safekeeper",
"serde_json",
"utils",
"workspace_hack",
]
[[package]]
name = "nix"
version = "0.23.1"
@@ -1633,26 +1786,32 @@ name = "os_str_bytes"
version = "6.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e22443d1643a904602595ba1cd8f7d896afe56d26712531c5ff73a15b2fbf64"
dependencies = [
"memchr",
]
[[package]]
name = "pageserver"
version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"byteorder",
"bytes",
"chrono",
"clap 3.1.12",
"clap 3.0.14",
"close_fds",
"const_format",
"crc32c",
"crossbeam-utils",
"daemonize",
"etcd_broker",
"fail",
"futures",
"git-version",
"hex",
"hex-literal",
"humantime",
"humantime-serde",
"hyper",
"itertools",
"lazy_static",
@@ -1666,8 +1825,7 @@ dependencies = [
"pprof",
"rand",
"regex",
"rusoto_core",
"rusoto_s3",
"remote_storage",
"scopeguard",
"serde",
"serde_json",
@@ -1679,11 +1837,11 @@ dependencies = [
"tokio",
"tokio-postgres",
"tokio-stream",
"tokio-util 0.7.0",
"toml_edit",
"tracing",
"url",
"utils",
"walkdir",
"workspace_hack",
]
@@ -1901,15 +2059,18 @@ dependencies = [
"bytes",
"chrono",
"crc32c",
"env_logger",
"hex",
"lazy_static",
"log",
"memoffset",
"postgres",
"rand",
"regex",
"serde",
"thiserror",
"utils",
"wal_generate",
"workspace_hack",
]
@@ -1939,27 +2100,13 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eb9f9e6e233e5c4a35559a617bf40a4ec447db2e84c20b55a6f83167b7e57872"
[[package]]
name = "proc-macro-error"
version = "1.0.4"
name = "prettyplease"
version = "0.1.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "da25490ff9892aab3fcf7c36f08cfb902dd3e71ca0f9f9517bea02a73a5ce38c"
checksum = "d9e07e3a46d0771a8a06b5f4441527802830b43e679ba12f44960f48dd4c6803"
dependencies = [
"proc-macro-error-attr",
"proc-macro2",
"quote",
"syn",
"version_check",
]
[[package]]
name = "proc-macro-error-attr"
version = "1.0.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a1be40180e52ecc98ad80b184934baf3d0d29f979574e439af5a55274b35f869"
dependencies = [
"proc-macro2",
"quote",
"version_check",
]
[[package]]
@@ -1977,6 +2124,20 @@ dependencies = [
"unicode-xid",
]
[[package]]
name = "procfs"
version = "0.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "95e344cafeaeefe487300c361654bcfc85db3ac53619eeccced29f5ea18c4c70"
dependencies = [
"bitflags",
"byteorder",
"flate2",
"hex",
"lazy_static",
"libc",
]
[[package]]
name = "prometheus"
version = "0.13.0"
@@ -1986,16 +2147,18 @@ dependencies = [
"cfg-if",
"fnv",
"lazy_static",
"libc",
"memchr",
"parking_lot 0.11.2",
"procfs",
"thiserror",
]
[[package]]
name = "prost"
version = "0.9.0"
version = "0.10.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "444879275cb4fd84958b1a1d5420d15e6fcf7c235fe47f053c9c2a80aceb6001"
checksum = "bc03e116981ff7d8da8e5c220e374587b98d294af7ba7dd7fda761158f00086f"
dependencies = [
"bytes",
"prost-derive",
@@ -2003,12 +2166,14 @@ dependencies = [
[[package]]
name = "prost-build"
version = "0.9.0"
version = "0.10.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62941722fb675d463659e49c4f3fe1fe792ff24fe5bbaa9c08cd3b98a1c354f5"
checksum = "65a1118354442de7feb8a2a76f3d80ef01426bd45542c8c1fdffca41a758f846"
dependencies = [
"bytes",
"heck 0.3.3",
"cfg-if",
"cmake",
"heck 0.4.0",
"itertools",
"lazy_static",
"log",
@@ -2023,9 +2188,9 @@ dependencies = [
[[package]]
name = "prost-derive"
version = "0.9.0"
version = "0.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f9cc1a3263e07e0bf68e96268f37665207b49560d98739662cdfaae215c720fe"
checksum = "7b670f45da57fb8542ebdbb6105a925fe571b67f9e7ed9f47a06a84e72b4e7cc"
dependencies = [
"anyhow",
"itertools",
@@ -2036,9 +2201,9 @@ dependencies = [
[[package]]
name = "prost-types"
version = "0.9.0"
version = "0.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "534b7a0e836e3c482d2693070f982e39e7611da9695d4d1f5a4b186b51faef0a"
checksum = "2d0a014229361011dc8e69c8a1ec6c2e8d0f2af7c91e3ea3f5b2170298461e68"
dependencies = [
"bytes",
"prost",
@@ -2052,8 +2217,9 @@ dependencies = [
"async-trait",
"base64",
"bytes",
"clap 3.1.12",
"clap 3.0.14",
"futures",
"git-version",
"hashbrown",
"hex",
"hmac 0.12.1",
@@ -2080,6 +2246,7 @@ dependencies = [
"tokio-postgres",
"tokio-postgres-rustls",
"tokio-rustls",
"url",
"utils",
"workspace_hack",
]
@@ -2209,9 +2376,9 @@ dependencies = [
[[package]]
name = "regex"
version = "1.5.4"
version = "1.5.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d07a8629359eb56f1e2fb1652bb04212c072a87ba68546a04065d525673ac461"
checksum = "1a11647b6b25ff05a515cb92c365cec08801e83423a235b51e231e1808747286"
dependencies = [
"aho-corasick",
"memchr",
@@ -2233,6 +2400,23 @@ version = "0.6.25"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f497285884f3fcff424ffc933e56d7cbca511def0c9831a7f9b5f6153e3cc89b"
[[package]]
name = "remote_storage"
version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"rusoto_core",
"rusoto_s3",
"serde",
"serde_json",
"tempfile",
"tokio",
"tokio-util 0.7.0",
"tracing",
"workspace_hack",
]
[[package]]
name = "remove_dir_all"
version = "0.5.3"
@@ -2332,9 +2516,9 @@ dependencies = [
[[package]]
name = "rusoto_core"
version = "0.47.0"
version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5b4f000e8934c1b4f70adde180056812e7ea6b1a247952db8ee98c94cd3116cc"
checksum = "1db30db44ea73551326269adcf7a2169428a054f14faf9e1768f2163494f2fa2"
dependencies = [
"async-trait",
"base64",
@@ -2357,9 +2541,9 @@ dependencies = [
[[package]]
name = "rusoto_credential"
version = "0.47.0"
version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6a46b67db7bb66f5541e44db22b0a02fed59c9603e146db3a9e633272d3bac2f"
checksum = "ee0a6c13db5aad6047b6a44ef023dbbc21a056b6dab5be3b79ce4283d5c02d05"
dependencies = [
"async-trait",
"chrono",
@@ -2375,9 +2559,9 @@ dependencies = [
[[package]]
name = "rusoto_s3"
version = "0.47.0"
version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "048c2fe811a823ad5a9acc976e8bf4f1d910df719dcf44b15c3e96c5b7a51027"
checksum = "7aae4677183411f6b0b412d66194ef5403293917d66e70ab118f07cc24c5b14d"
dependencies = [
"async-trait",
"bytes",
@@ -2388,9 +2572,9 @@ dependencies = [
[[package]]
name = "rusoto_signature"
version = "0.47.0"
version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6264e93384b90a747758bcc82079711eacf2e755c3a8b5091687b5349d870bcc"
checksum = "a5ae95491c8b4847931e291b151127eccd6ff8ca13f33603eb3d0035ecb05272"
dependencies = [
"base64",
"bytes",
@@ -2482,12 +2666,13 @@ dependencies = [
"anyhow",
"byteorder",
"bytes",
"clap 3.1.12",
"clap 3.0.14",
"const_format",
"crc32c",
"daemonize",
"etcd-client",
"etcd_broker",
"fs2",
"git-version",
"hex",
"humantime",
"hyper",
@@ -2497,8 +2682,7 @@ dependencies = [
"postgres-protocol",
"postgres_ffi",
"regex",
"rusoto_core",
"rusoto_s3",
"remote_storage",
"serde",
"serde_json",
"serde_with",
@@ -2703,6 +2887,17 @@ dependencies = [
"signal-hook-registry",
]
[[package]]
name = "signal-hook-mio"
version = "0.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "29ad2e15f37ec9a6cc544097b78a1ec90001e9f71b81338ca39f430adaca99af"
dependencies = [
"libc",
"mio",
"signal-hook",
]
[[package]]
name = "signal-hook-registry"
version = "1.4.0"
@@ -2792,6 +2987,25 @@ version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "73473c0e59e6d5812c5dfe2a064a6444949f089e20eec9a2e5506596494e4623"
[[package]]
name = "strum"
version = "0.23.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cae14b91c7d11c9a851d3fbc80a963198998c2a64eec840477fa92d8ce9b70bb"
[[package]]
name = "strum_macros"
version = "0.23.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5bb0dc7ee9c15cea6199cde9a127fa16a4c5819af85395457ad72d68edc85a38"
dependencies = [
"heck 0.3.3",
"proc-macro2",
"quote",
"rustversion",
"syn",
]
[[package]]
name = "subtle"
version = "2.4.1"
@@ -2823,15 +3037,21 @@ dependencies = [
[[package]]
name = "syn"
version = "1.0.86"
version = "1.0.92"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8a65b3f4ffa0092e9887669db0eae07941f023991ab58ea44da8fe8e2d511c6b"
checksum = "7ff7c592601f11445996a06f8ad0c27f094a58857c2f89e97974ab9235b92c52"
dependencies = [
"proc-macro2",
"quote",
"unicode-xid",
]
[[package]]
name = "sync_wrapper"
version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "20518fe4a4c9acf048008599e464deb21beeae3d3578418951a189c235a7a9a8"
[[package]]
name = "tar"
version = "0.4.38"
@@ -2877,9 +3097,9 @@ dependencies = [
[[package]]
name = "textwrap"
version = "0.15.0"
version = "0.14.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b1141d4d61095b28419e22cb0bbf02755f5e54e0526f97f1e3d1d160e60885fb"
checksum = "0066c8d12af8b5acd21e00547c3797fde4e8677254a7ee429176ccebbe93dd80"
[[package]]
name = "thiserror"
@@ -3125,12 +3345,13 @@ dependencies = [
[[package]]
name = "tonic"
version = "0.6.2"
version = "0.7.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ff08f4649d10a70ffa3522ca559031285d8e421d727ac85c60825761818f5d0a"
checksum = "5be9d60db39854b30b835107500cf0aca0b0d14d6e1c3de124217c23a29c2ddb"
dependencies = [
"async-stream",
"async-trait",
"axum",
"base64",
"bytes",
"futures-core",
@@ -3146,7 +3367,7 @@ dependencies = [
"prost-derive",
"tokio",
"tokio-stream",
"tokio-util 0.6.9",
"tokio-util 0.7.0",
"tower",
"tower-layer",
"tower-service",
@@ -3156,10 +3377,11 @@ dependencies = [
[[package]]
name = "tonic-build"
version = "0.6.2"
version = "0.7.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9403f1bafde247186684b230dc6f38b5cd514584e8bec1dd32514be4745fa757"
checksum = "d9263bf4c9bfaae7317c1c2faf7f18491d2fe476f70c414b73bf5d445b00ffa1"
dependencies = [
"prettyplease",
"proc-macro2",
"prost-build",
"quote",
@@ -3186,6 +3408,25 @@ dependencies = [
"tracing",
]
[[package]]
name = "tower-http"
version = "0.3.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e980386f06883cf4d0578d6c9178c81f68b45d77d00f2c2c1bc034b3439c2c56"
dependencies = [
"bitflags",
"bytes",
"futures-core",
"futures-util",
"http",
"http-body",
"http-range-header",
"pin-project-lite",
"tower",
"tower-layer",
"tower-service",
]
[[package]]
name = "tower-layer"
version = "0.3.1"
@@ -3401,6 +3642,18 @@ version = "0.9.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f"
[[package]]
name = "wal_generate"
version = "0.1.0"
dependencies = [
"anyhow",
"clap 3.0.14",
"env_logger",
"log",
"postgres",
"tempfile",
]
[[package]]
name = "walkdir"
version = "2.3.2"
@@ -3627,13 +3880,22 @@ dependencies = [
name = "workspace_hack"
version = "0.1.0"
dependencies = [
"ahash",
"anyhow",
"bytes",
"chrono",
"clap 2.34.0",
"either",
"fail",
"futures-channel",
"futures-task",
"futures-util",
"generic-array",
"hashbrown",
"hex",
"hyper",
"indexmap",
"itoa 0.4.8",
"libc",
"log",
"memchr",
@@ -3647,6 +3909,7 @@ dependencies = [
"serde",
"syn",
"tokio",
"tokio-util 0.7.0",
"tracing",
"tracing-core",
]
@@ -3675,22 +3938,6 @@ dependencies = [
"chrono",
]
[[package]]
name = "zenith"
version = "0.1.0"
dependencies = [
"anyhow",
"clap 3.1.12",
"control_plane",
"pageserver",
"postgres",
"postgres_ffi",
"safekeeper",
"serde_json",
"utils",
"workspace_hack",
]
[[package]]
name = "zeroize"
version = "1.5.2"

View File

@@ -6,7 +6,7 @@ members = [
"proxy",
"safekeeper",
"workspace_hack",
"zenith",
"neon_local",
"libs/*",
]

View File

@@ -15,4 +15,4 @@ RUN set -e \
# Final image that only has one binary
FROM debian:buster-slim
COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl
COPY --from=rust-build /home/circleci/project/target/release/compute_ctl /usr/local/bin/compute_ctl

View File

@@ -12,15 +12,21 @@ endif
#
BUILD_TYPE ?= debug
ifeq ($(BUILD_TYPE),release)
PG_CONFIGURE_OPTS = --enable-debug
PG_CONFIGURE_OPTS = --enable-debug --with-openssl
PG_CFLAGS = -O2 -g3 $(CFLAGS)
# Unfortunately, `--profile=...` is a nightly feature
CARGO_BUILD_FLAGS += --release
else ifeq ($(BUILD_TYPE),debug)
PG_CONFIGURE_OPTS = --enable-debug --enable-cassert --enable-depend
PG_CONFIGURE_OPTS = --enable-debug --with-openssl --enable-cassert --enable-depend
PG_CFLAGS = -O0 -g3 $(CFLAGS)
else
$(error Bad build type `$(BUILD_TYPE)', see Makefile for options)
$(error Bad build type '$(BUILD_TYPE)', see Makefile for options)
endif
# macOS with brew-installed openssl requires explicit paths
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Darwin)
PG_CONFIGURE_OPTS += --with-includes=/usr/local/opt/openssl/include --with-libraries=/usr/local/opt/openssl/lib
endif
# Choose whether we should be silent or verbose

5
NOTICE Normal file
View File

@@ -0,0 +1,5 @@
Neon
Copyright 2022 Neon Inc.
The PostgreSQL submodule in vendor/postgres is licensed under the
PostgreSQL license. See vendor/postgres/COPYRIGHT.

126
README.md
View File

@@ -23,61 +23,100 @@ Pageserver consists of:
## Running local installation
#### building on Ubuntu/ Debian (Linux)
1. Install build dependencies and other useful packages
On Ubuntu or Debian this set of packages should be sufficient to build the code:
```text
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev
libssl-dev clang pkg-config libpq-dev libprotobuf-dev etcd
```
[Rust] 1.58 or later is also required.
2. [Install Rust](https://www.rust-lang.org/tools/install)
```
# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
3. Install PostgreSQL Client
```
apt install postgresql-client
```
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
2. Build neon and patched postgres
4. Build neon and patched postgres
```sh
git clone --recursive https://github.com/neondatabase/neon.git
cd neon
make -j5
```
3. Start pageserver and postgres on top of it (should be called from repo root):
#### building on OSX (12.3.1)
1. Install XCode and dependencies
```
xcode-select --install
brew install protobuf etcd
```
2. [Install Rust](https://www.rust-lang.org/tools/install)
```
# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
3. Install PostgreSQL Client
```
# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq
```
4. Build neon and patched postgres
```sh
git clone --recursive https://github.com/neondatabase/neon.git
cd neon
make -j5
```
#### dependency installation notes
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.9 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
#### running neon database
1. Start pageserver and postgres on top of it (should be called from repo root):
```sh
# Create repository in .zenith with proper paths to binaries and data
# Later that would be responsibility of a package install script
> ./target/debug/zenith init
initializing tenantid c03ba6b7ad4c5e9cf556f059ade44229
created initial timeline 5b014a9e41b4b63ce1a1febc04503636 timeline.lsn 0/169C3C8
created main branch
> ./target/debug/neon_local init
initializing tenantid 9ef87a5bf0d92544f6fafeeb3239695c
created initial timeline de200bd42b49cc1814412c7e592dd6e9 timeline.lsn 0/16B5A50
initial timeline de200bd42b49cc1814412c7e592dd6e9 created
pageserver init succeeded
# start pageserver and safekeeper
> ./target/debug/zenith start
Starting pageserver at 'localhost:64000' in '.zenith'
> ./target/debug/neon_local start
Starting pageserver at '127.0.0.1:64000' in '.zenith'
Pageserver started
initializing for single for 7676
Starting safekeeper at '127.0.0.1:5454' in '.zenith/safekeepers/single'
initializing for sk 1 for 7676
Starting safekeeper at '127.0.0.1:5454' in '.zenith/safekeepers/sk1'
Safekeeper started
# start postgres compute node
> ./target/debug/zenith pg start main
Starting new postgres main on timeline 5b014a9e41b4b63ce1a1febc04503636 ...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/c03ba6b7ad4c5e9cf556f059ade44229/main port=55432
> ./target/debug/neon_local pg start main
Starting new postgres main on timeline de200bd42b49cc1814412c7e592dd6e9 ...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/main port=55432
Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres'
waiting for server to start.... done
server started
# check list of running postgres instances
> ./target/debug/zenith pg list
NODE ADDRESS TIMELINES BRANCH NAME LSN STATUS
main 127.0.0.1:55432 5b014a9e41b4b63ce1a1febc04503636 main 0/1609610 running
> ./target/debug/neon_local pg list
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 de200bd42b49cc1814412c7e592dd6e9 main 0/16B5BA8 running
```
4. Now it is possible to connect to postgres and run some queries:
2. Now it is possible to connect to postgres and run some queries:
```text
> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
@@ -91,21 +130,28 @@ postgres=# select * from t;
(1 row)
```
5. And create branches and run postgres on them:
3. And create branches and run postgres on them:
```sh
# create branch named migration_check
> ./target/debug/zenith timeline branch --branch-name migration_check
Created timeline '0e9331cad6efbafe6a88dd73ae21a5c9' at Lsn 0/16F5830 for tenant: c03ba6b7ad4c5e9cf556f059ade44229. Ancestor timeline: 'main'
> ./target/debug/neon_local timeline branch --branch-name migration_check
Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c. Ancestor timeline: 'main'
# check branches tree
> ./target/debug/zenith timeline list
main [5b014a9e41b4b63ce1a1febc04503636]
┗━ @0/1609610: migration_check [0e9331cad6efbafe6a88dd73ae21a5c9]
> ./target/debug/neon_local timeline list
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]
# start postgres on that branch
> ./target/debug/zenith pg start migration_check
Starting postgres node at 'host=127.0.0.1 port=55433 user=stas'
waiting for server to start.... done
> ./target/debug/neon_local pg start migration_check --branch-name migration_check
Starting new postgres migration_check on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/migration_check port=55433
Starting postgres node at 'host=127.0.0.1 port=55433 user=zenith_admin dbname=postgres'
# check the new list of running postgres instances
> ./target/debug/neon_local pg list
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 de200bd42b49cc1814412c7e592dd6e9 main 0/16F9A38 running
migration_check 127.0.0.1:55433 b3b863fa45fa9e57e615f9f2d944e601 migration_check 0/16F9A70 running
# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
@@ -118,12 +164,20 @@ postgres=# select * from t;
postgres=# insert into t values(2,2);
INSERT 0 1
# check that the new change doesn't affect the 'main' postgres
> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres
postgres=# select * from t;
key | value
-----+-------
1 | 1
(1 row)
```
6. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances
4. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances
you have just started. You can stop them all with one command:
```sh
> ./target/debug/zenith stop
> ./target/debug/neon_local stop
```
## Running tests

View File

@@ -1,9 +1,9 @@
# Compute node tools
Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
`ExecStart` option. It will handle all the `zenith` specifics during compute node
Postgres wrapper (`compute_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
`ExecStart` option. It will handle all the `Neon` specifics during compute node
initialization:
- `zenith_ctl` accepts cluster (compute node) specification as a JSON file.
- `compute_ctl` accepts cluster (compute node) specification as a JSON file.
- Every start is a fresh start, so the data directory is removed and
initialized again on each run.
- Next it will put configuration files into the `PGDATA` directory.
@@ -13,18 +13,18 @@ initialization:
- Check and alter/drop/create roles and databases.
- Hang waiting on the `postmaster` process to exit.
Also `zenith_ctl` spawns two separate service threads:
Also `compute_ctl` spawns two separate service threads:
- `compute-monitor` checks the last Postgres activity timestamp and saves it
into the shared `ComputeState`;
into the shared `ComputeNode`;
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
Usage example:
```sh
zenith_ctl -D /var/db/postgres/compute \
-C 'postgresql://zenith_admin@localhost/postgres' \
-S /var/db/postgres/specs/current.json \
-b /usr/local/bin/postgres
compute_ctl -D /var/db/postgres/compute \
-C 'postgresql://zenith_admin@localhost/postgres' \
-S /var/db/postgres/specs/current.json \
-b /usr/local/bin/postgres
```
## Tests

View File

@@ -0,0 +1,174 @@
//!
//! Postgres wrapper (`compute_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
//! `ExecStart` option. It will handle all the `Neon` specifics during compute node
//! initialization:
//! - `compute_ctl` accepts cluster (compute node) specification as a JSON file.
//! - Every start is a fresh start, so the data directory is removed and
//! initialized again on each run.
//! - Next it will put configuration files into the `PGDATA` directory.
//! - Sync safekeepers and get commit LSN.
//! - Get `basebackup` from pageserver using the returned on the previous step LSN.
//! - Try to start `postgres` and wait until it is ready to accept connections.
//! - Check and alter/drop/create roles and databases.
//! - Hang waiting on the `postmaster` process to exit.
//!
//! Also `compute_ctl` spawns two separate service threads:
//! - `compute-monitor` checks the last Postgres activity timestamp and saves it
//! into the shared `ComputeNode`;
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! Usage example:
//! ```sh
//! compute_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://zenith_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres
//! ```
//!
use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::exit;
use std::sync::{Arc, RwLock};
use std::{thread, time::Duration};
use anyhow::Result;
use chrono::Utc;
use clap::Arg;
use log::{error, info};
use compute_tools::compute::{ComputeMetrics, ComputeNode, ComputeState, ComputeStatus};
use compute_tools::http::api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
use compute_tools::pg_helpers::*;
use compute_tools::spec::*;
fn main() -> Result<()> {
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
// Env variable is set by `cargo`
let version: Option<&str> = option_env!("CARGO_PKG_VERSION");
let matches = clap::App::new("compute_ctl")
.version(version.unwrap_or("unknown"))
.arg(
Arg::new("connstr")
.short('C')
.long("connstr")
.value_name("DATABASE_URL")
.required(true),
)
.arg(
Arg::new("pgdata")
.short('D')
.long("pgdata")
.value_name("DATADIR")
.required(true),
)
.arg(
Arg::new("pgbin")
.short('b')
.long("pgbin")
.value_name("POSTGRES_PATH"),
)
.arg(
Arg::new("spec")
.short('s')
.long("spec")
.value_name("SPEC_JSON"),
)
.arg(
Arg::new("spec-path")
.short('S')
.long("spec-path")
.value_name("SPEC_PATH"),
)
.get_matches();
let pgdata = matches.value_of("pgdata").expect("PGDATA path is required");
let connstr = matches
.value_of("connstr")
.expect("Postgres connection string is required");
let spec = matches.value_of("spec");
let spec_path = matches.value_of("spec-path");
// Try to use just 'postgres' if no path is provided
let pgbin = matches.value_of("pgbin").unwrap_or("postgres");
let spec: ComputeSpec = match spec {
// First, try to get cluster spec from the cli argument
Some(json) => serde_json::from_str(json)?,
None => {
// Second, try to read it from the file if path is provided
if let Some(sp) = spec_path {
let path = Path::new(sp);
let file = File::open(path)?;
serde_json::from_reader(file)?
} else {
panic!("cluster spec should be provided via --spec or --spec-path argument");
}
}
};
let pageserver_connstr = spec
.cluster
.settings
.find("zenith.page_server_connstring")
.expect("pageserver connstr should be provided");
let tenant = spec
.cluster
.settings
.find("zenith.zenith_tenant")
.expect("tenant id should be provided");
let timeline = spec
.cluster
.settings
.find("zenith.zenith_timeline")
.expect("tenant id should be provided");
let compute_state = ComputeNode {
start_time: Utc::now(),
connstr: connstr.to_string(),
pgdata: pgdata.to_string(),
pgbin: pgbin.to_string(),
spec,
tenant,
timeline,
pageserver_connstr,
metrics: ComputeMetrics::new(),
state: RwLock::new(ComputeState::new()),
};
let compute = Arc::new(compute_state);
// Launch service threads first, so we were able to serve availability
// requests, while configuration is still in progress.
let _http_handle = launch_http_server(&compute).expect("cannot launch http endpoint thread");
let _monitor_handle = launch_monitor(&compute).expect("cannot launch compute monitor thread");
// Run compute (Postgres) and hang waiting on it.
match compute.prepare_and_run() {
Ok(ec) => {
let code = ec.code().unwrap_or(1);
info!("Postgres exited with code {}, shutting down", code);
exit(code)
}
Err(error) => {
error!("could not start the compute node: {}", error);
let mut state = compute.state.write().unwrap();
state.error = Some(format!("{:?}", error));
state.status = ComputeStatus::Failed;
drop(state);
// Keep serving HTTP requests, so the cloud control plane was able to
// get the actual error.
info!("giving control plane 30s to collect the error before shutdown");
thread::sleep(Duration::from_secs(30));
info!("shutting down");
Err(error)
}
}
}

View File

@@ -1,252 +0,0 @@
//!
//! Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
//! `ExecStart` option. It will handle all the `zenith` specifics during compute node
//! initialization:
//! - `zenith_ctl` accepts cluster (compute node) specification as a JSON file.
//! - Every start is a fresh start, so the data directory is removed and
//! initialized again on each run.
//! - Next it will put configuration files into the `PGDATA` directory.
//! - Sync safekeepers and get commit LSN.
//! - Get `basebackup` from pageserver using the returned on the previous step LSN.
//! - Try to start `postgres` and wait until it is ready to accept connections.
//! - Check and alter/drop/create roles and databases.
//! - Hang waiting on the `postmaster` process to exit.
//!
//! Also `zenith_ctl` spawns two separate service threads:
//! - `compute-monitor` checks the last Postgres activity timestamp and saves it
//! into the shared `ComputeState`;
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! Usage example:
//! ```sh
//! zenith_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://zenith_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres
//! ```
//!
use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::{exit, Command, ExitStatus};
use std::sync::{Arc, RwLock};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use log::info;
use postgres::{Client, NoTls};
use compute_tools::checker::create_writablity_check_data;
use compute_tools::config;
use compute_tools::http_api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
use compute_tools::pg_helpers::*;
use compute_tools::spec::*;
use compute_tools::zenith::*;
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
fn prepare_pgdata(state: &Arc<RwLock<ComputeState>>) -> Result<()> {
let state = state.read().unwrap();
let spec = &state.spec;
let pgdata_path = Path::new(&state.pgdata);
let pageserver_connstr = spec
.cluster
.settings
.find("zenith.page_server_connstring")
.expect("pageserver connstr should be provided");
let tenant = spec
.cluster
.settings
.find("zenith.zenith_tenant")
.expect("tenant id should be provided");
let timeline = spec
.cluster
.settings
.find("zenith.zenith_timeline")
.expect("tenant id should be provided");
info!(
"starting cluster #{}, operation #{}",
spec.cluster.cluster_id,
spec.operation_uuid.as_ref().unwrap()
);
// Remove/create an empty pgdata directory and put configuration there.
create_pgdata(&state.pgdata)?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
info!("starting safekeepers syncing");
let lsn = sync_safekeepers(&state.pgdata, &state.pgbin)
.with_context(|| "failed to sync safekeepers")?;
info!("safekeepers synced at LSN {}", lsn);
info!(
"getting basebackup@{} from pageserver {}",
lsn, pageserver_connstr
);
get_basebackup(&state.pgdata, &pageserver_connstr, &tenant, &timeline, &lsn).with_context(
|| {
format!(
"failed to get basebackup@{} from pageserver {}",
lsn, pageserver_connstr
)
},
)?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
let read_state = state.read().unwrap();
let pgdata_path = Path::new(&read_state.pgdata);
// Run postgres as a child process.
let mut pg = Command::new(&read_state.pgbin)
.args(&["-D", &read_state.pgdata])
.spawn()
.expect("cannot start postgres process");
// Try default Postgres port if it is not provided
let port = read_state
.spec
.cluster
.settings
.find("port")
.unwrap_or_else(|| "5432".to_string());
wait_for_postgres(&port, pgdata_path)?;
let mut client = Client::connect(&read_state.connstr, NoTls)?;
handle_roles(&read_state.spec, &mut client)?;
handle_databases(&read_state.spec, &mut client)?;
handle_grants(&read_state.spec, &mut client)?;
create_writablity_check_data(&mut client)?;
// 'Close' connection
drop(client);
info!(
"finished configuration of cluster #{}",
read_state.spec.cluster.cluster_id
);
// Release the read lock.
drop(read_state);
// Get the write lock, update state and release the lock, so HTTP API
// was able to serve requests, while we are blocked waiting on
// Postgres.
let mut state = state.write().unwrap();
state.ready = true;
drop(state);
// Wait for child postgres process basically forever. In this state Ctrl+C
// will be propagated to postgres and it will be shut down as well.
let ecode = pg.wait().expect("failed to wait on postgres");
Ok(ecode)
}
fn main() -> Result<()> {
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
// Env variable is set by `cargo`
let version: Option<&str> = option_env!("CARGO_PKG_VERSION");
let matches = clap::Command::new("zenith_ctl")
.version(version.unwrap_or("unknown"))
.arg(
Arg::new("connstr")
.short('C')
.long("connstr")
.value_name("DATABASE_URL")
.required(true),
)
.arg(
Arg::new("pgdata")
.short('D')
.long("pgdata")
.value_name("DATADIR")
.required(true),
)
.arg(
Arg::new("pgbin")
.short('b')
.long("pgbin")
.value_name("POSTGRES_PATH"),
)
.arg(
Arg::new("spec")
.short('s')
.long("spec")
.value_name("SPEC_JSON"),
)
.arg(
Arg::new("spec-path")
.short('S')
.long("spec-path")
.value_name("SPEC_PATH"),
)
.get_matches();
let pgdata = matches.value_of("pgdata").expect("PGDATA path is required");
let connstr = matches
.value_of("connstr")
.expect("Postgres connection string is required");
let spec = matches.value_of("spec");
let spec_path = matches.value_of("spec-path");
// Try to use just 'postgres' if no path is provided
let pgbin = matches.value_of("pgbin").unwrap_or("postgres");
let spec: ClusterSpec = match spec {
// First, try to get cluster spec from the cli argument
Some(json) => serde_json::from_str(json)?,
None => {
// Second, try to read it from the file if path is provided
if let Some(sp) = spec_path {
let path = Path::new(sp);
let file = File::open(path)?;
serde_json::from_reader(file)?
} else {
panic!("cluster spec should be provided via --spec or --spec-path argument");
}
}
};
let compute_state = ComputeState {
connstr: connstr.to_string(),
pgdata: pgdata.to_string(),
pgbin: pgbin.to_string(),
spec,
ready: false,
last_active: Utc::now(),
};
let compute_state = Arc::new(RwLock::new(compute_state));
// Launch service threads first, so we were able to serve availability
// requests, while configuration is still in progress.
let mut _threads = vec![
launch_http_server(&compute_state).expect("cannot launch compute monitor thread"),
launch_monitor(&compute_state).expect("cannot launch http endpoint thread"),
];
prepare_pgdata(&compute_state)?;
// Run compute (Postgres) and hang waiting on it. Panic if any error happens,
// it will help us to trigger unwind and kill postmaster as well.
match run_compute(&compute_state) {
Ok(ec) => exit(ec.success() as i32),
Err(error) => panic!("cannot start compute node, error: {}", error),
}
}

View File

@@ -1,11 +1,11 @@
use std::sync::{Arc, RwLock};
use std::sync::Arc;
use anyhow::{anyhow, Result};
use log::error;
use postgres::Client;
use tokio_postgres::NoTls;
use crate::zenith::ComputeState;
use crate::compute::ComputeNode;
pub fn create_writablity_check_data(client: &mut Client) -> Result<()> {
let query = "
@@ -23,9 +23,9 @@ pub fn create_writablity_check_data(client: &mut Client) -> Result<()> {
Ok(())
}
pub async fn check_writability(state: &Arc<RwLock<ComputeState>>) -> Result<()> {
let connstr = state.read().unwrap().connstr.clone();
let (client, connection) = tokio_postgres::connect(&connstr, NoTls).await?;
pub async fn check_writability(compute: &Arc<ComputeNode>) -> Result<()> {
let connstr = &compute.connstr;
let (client, connection) = tokio_postgres::connect(connstr, NoTls).await?;
if client.is_closed() {
return Err(anyhow!("connection to postgres closed"));
}

View File

@@ -0,0 +1,315 @@
//
// XXX: This starts to be scarry similar to the `PostgresNode` from `control_plane`,
// but there are several things that makes `PostgresNode` usage inconvenient in the
// cloud:
// - it inherits from `LocalEnv`, which contains **all-all** the information about
// a complete service running
// - it uses `PageServerNode` with information about http endpoint, which we do not
// need in the cloud again
// - many tiny pieces like, for example, we do not use `pg_ctl` in the cloud
//
// Thus, to use `PostgresNode` in the cloud, we need to 'mock' a bunch of required
// attributes (not required for the cloud). Yet, it is still tempting to unify these
// `PostgresNode` and `ComputeNode` and use one in both places.
//
// TODO: stabilize `ComputeNode` and think about using it in the `control_plane`.
//
use std::fs;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, ExitStatus, Stdio};
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::RwLock;
use anyhow::{Context, Result};
use chrono::{DateTime, Utc};
use log::info;
use postgres::{Client, NoTls};
use serde::{Serialize, Serializer};
use crate::checker::create_writablity_check_data;
use crate::config;
use crate::pg_helpers::*;
use crate::spec::*;
/// Compute node info shared across several `compute_ctl` threads.
pub struct ComputeNode {
pub start_time: DateTime<Utc>,
pub connstr: String,
pub pgdata: String,
pub pgbin: String,
pub spec: ComputeSpec,
pub tenant: String,
pub timeline: String,
pub pageserver_connstr: String,
pub metrics: ComputeMetrics,
/// Volatile part of the `ComputeNode` so should be used under `RwLock`
/// to allow HTTP API server to serve status requests, while configuration
/// is in progress.
pub state: RwLock<ComputeState>,
}
fn rfc3339_serialize<S>(x: &DateTime<Utc>, s: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
x.to_rfc3339().serialize(s)
}
#[derive(Serialize)]
#[serde(rename_all = "snake_case")]
pub struct ComputeState {
pub status: ComputeStatus,
/// Timestamp of the last Postgres activity
#[serde(serialize_with = "rfc3339_serialize")]
pub last_active: DateTime<Utc>,
pub error: Option<String>,
}
impl ComputeState {
pub fn new() -> Self {
Self {
status: ComputeStatus::Init,
last_active: Utc::now(),
error: None,
}
}
}
impl Default for ComputeState {
fn default() -> Self {
Self::new()
}
}
#[derive(Serialize, Clone, Copy, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum ComputeStatus {
Init,
Running,
Failed,
}
#[derive(Serialize)]
pub struct ComputeMetrics {
pub sync_safekeepers_ms: AtomicU64,
pub basebackup_ms: AtomicU64,
pub config_ms: AtomicU64,
pub total_startup_ms: AtomicU64,
}
impl ComputeMetrics {
pub fn new() -> Self {
Self {
sync_safekeepers_ms: AtomicU64::new(0),
basebackup_ms: AtomicU64::new(0),
config_ms: AtomicU64::new(0),
total_startup_ms: AtomicU64::new(0),
}
}
}
impl Default for ComputeMetrics {
fn default() -> Self {
Self::new()
}
}
impl ComputeNode {
pub fn set_status(&self, status: ComputeStatus) {
self.state.write().unwrap().status = status;
}
pub fn get_status(&self) -> ComputeStatus {
self.state.read().unwrap().status
}
// Remove `pgdata` directory and create it again with right permissions.
fn create_pgdata(&self) -> Result<()> {
// Ignore removal error, likely it is a 'No such file or directory (os error 2)'.
// If it is something different then create_dir() will error out anyway.
let _ok = fs::remove_dir_all(&self.pgdata);
fs::create_dir(&self.pgdata)?;
fs::set_permissions(&self.pgdata, fs::Permissions::from_mode(0o700))?;
Ok(())
}
// Get basebackup from the libpq connection to pageserver using `connstr` and
// unarchive it to `pgdata` directory overriding all its previous content.
fn get_basebackup(&self, lsn: &str) -> Result<()> {
let start_time = Utc::now();
let mut client = Client::connect(&self.pageserver_connstr, NoTls)?;
let basebackup_cmd = match lsn {
"0/0" => format!("basebackup {} {}", &self.tenant, &self.timeline), // First start of the compute
_ => format!("basebackup {} {} {}", &self.tenant, &self.timeline, lsn),
};
let copyreader = client.copy_out(basebackup_cmd.as_str())?;
let mut ar = tar::Archive::new(copyreader);
ar.unpack(&self.pgdata)?;
self.metrics.basebackup_ms.store(
Utc::now()
.signed_duration_since(start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
Ok(())
}
// Run `postgres` in a special mode with `--sync-safekeepers` argument
// and return the reported LSN back to the caller.
fn sync_safekeepers(&self) -> Result<String> {
let start_time = Utc::now();
let sync_handle = Command::new(&self.pgbin)
.args(&["--sync-safekeepers"])
.env("PGDATA", &self.pgdata) // we cannot use -D in this mode
.stdout(Stdio::piped())
.spawn()
.expect("postgres --sync-safekeepers failed to start");
// `postgres --sync-safekeepers` will print all log output to stderr and
// final LSN to stdout. So we pipe only stdout, while stderr will be automatically
// redirected to the caller output.
let sync_output = sync_handle
.wait_with_output()
.expect("postgres --sync-safekeepers failed");
if !sync_output.status.success() {
anyhow::bail!(
"postgres --sync-safekeepers exited with non-zero status: {}",
sync_output.status,
);
}
self.metrics.sync_safekeepers_ms.store(
Utc::now()
.signed_duration_since(start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
let lsn = String::from(String::from_utf8(sync_output.stdout)?.trim());
Ok(lsn)
}
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
pub fn prepare_pgdata(&self) -> Result<()> {
let spec = &self.spec;
let pgdata_path = Path::new(&self.pgdata);
// Remove/create an empty pgdata directory and put configuration there.
self.create_pgdata()?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
info!("starting safekeepers syncing");
let lsn = self
.sync_safekeepers()
.with_context(|| "failed to sync safekeepers")?;
info!("safekeepers synced at LSN {}", lsn);
info!(
"getting basebackup@{} from pageserver {}",
lsn, &self.pageserver_connstr
);
self.get_basebackup(&lsn).with_context(|| {
format!(
"failed to get basebackup@{} from pageserver {}",
lsn, &self.pageserver_connstr
)
})?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
pub fn run(&self) -> Result<ExitStatus> {
let start_time = Utc::now();
let pgdata_path = Path::new(&self.pgdata);
// Run postgres as a child process.
let mut pg = Command::new(&self.pgbin)
.args(&["-D", &self.pgdata])
.spawn()
.expect("cannot start postgres process");
// Try default Postgres port if it is not provided
let port = self
.spec
.cluster
.settings
.find("port")
.unwrap_or_else(|| "5432".to_string());
wait_for_postgres(&mut pg, &port, pgdata_path)?;
let mut client = Client::connect(&self.connstr, NoTls)?;
handle_roles(&self.spec, &mut client)?;
handle_databases(&self.spec, &mut client)?;
handle_grants(&self.spec, &mut client)?;
create_writablity_check_data(&mut client)?;
// 'Close' connection
drop(client);
let startup_end_time = Utc::now();
self.metrics.config_ms.store(
startup_end_time
.signed_duration_since(start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
self.metrics.total_startup_ms.store(
startup_end_time
.signed_duration_since(self.start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
self.set_status(ComputeStatus::Running);
info!(
"finished configuration of compute for project {}",
self.spec.cluster.cluster_id
);
// Wait for child Postgres process basically forever. In this state Ctrl+C
// will propagate to Postgres and it will be shut down as well.
let ecode = pg
.wait()
.expect("failed to start waiting on Postgres process");
Ok(ecode)
}
pub fn prepare_and_run(&self) -> Result<ExitStatus> {
info!(
"starting compute for project {}, operation {}, tenant {}, timeline {}",
self.spec.cluster.cluster_id,
self.spec.operation_uuid.as_ref().unwrap(),
self.tenant,
self.timeline,
);
self.prepare_pgdata()?;
self.run()
}
}

View File

@@ -6,7 +6,7 @@ use std::path::Path;
use anyhow::Result;
use crate::pg_helpers::PgOptionsSerialize;
use crate::zenith::ClusterSpec;
use crate::spec::ComputeSpec;
/// Check that `line` is inside a text file and put it there if it is not.
/// Create file if it doesn't exist.
@@ -32,20 +32,20 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
}
/// Create or completely rewrite configuration file specified by `path`
pub fn write_postgres_conf(path: &Path, spec: &ClusterSpec) -> Result<()> {
pub fn write_postgres_conf(path: &Path, spec: &ComputeSpec) -> Result<()> {
// File::create() destroys the file content if it exists.
let mut postgres_conf = File::create(path)?;
write_zenith_managed_block(&mut postgres_conf, &spec.cluster.settings.as_pg_settings())?;
write_auto_managed_block(&mut postgres_conf, &spec.cluster.settings.as_pg_settings())?;
Ok(())
}
// Write Postgres config block wrapped with generated comment section
fn write_zenith_managed_block(file: &mut File, buf: &str) -> Result<()> {
writeln!(file, "# Managed by Zenith: begin")?;
fn write_auto_managed_block(file: &mut File, buf: &str) -> Result<()> {
writeln!(file, "# Managed by compute_ctl: begin")?;
writeln!(file, "{}", buf)?;
writeln!(file, "# Managed by Zenith: end")?;
writeln!(file, "# Managed by compute_ctl: end")?;
Ok(())
}

View File

@@ -1,37 +1,64 @@
use std::convert::Infallible;
use std::net::SocketAddr;
use std::sync::{Arc, RwLock};
use std::sync::Arc;
use std::thread;
use anyhow::Result;
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Method, Request, Response, Server, StatusCode};
use log::{error, info};
use serde_json;
use crate::zenith::*;
use crate::compute::{ComputeNode, ComputeStatus};
// Service function to handle all available routes.
async fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body> {
async fn routes(req: Request<Body>, compute: Arc<ComputeNode>) -> Response<Body> {
match (req.method(), req.uri().path()) {
// Timestamp of the last Postgres activity in the plain text.
// DEPRECATED in favour of /status
(&Method::GET, "/last_activity") => {
info!("serving /last_active GET request");
let state = state.read().unwrap();
let state = compute.state.read().unwrap();
// Use RFC3339 format for consistency.
Response::new(Body::from(state.last_active.to_rfc3339()))
}
// Has compute setup process finished? -> true/false
// Has compute setup process finished? -> true/false.
// DEPRECATED in favour of /status
(&Method::GET, "/ready") => {
info!("serving /ready GET request");
let state = state.read().unwrap();
Response::new(Body::from(format!("{}", state.ready)))
let status = compute.get_status();
Response::new(Body::from(format!("{}", status == ComputeStatus::Running)))
}
// Serialized compute state.
(&Method::GET, "/status") => {
info!("serving /status GET request");
let state = compute.state.read().unwrap();
Response::new(Body::from(serde_json::to_string(&*state).unwrap()))
}
// Startup metrics in JSON format. Keep /metrics reserved for a possible
// future use for Prometheus metrics format.
(&Method::GET, "/metrics.json") => {
info!("serving /metrics.json GET request");
Response::new(Body::from(serde_json::to_string(&compute.metrics).unwrap()))
}
// DEPRECATED, use POST instead
(&Method::GET, "/check_writability") => {
info!("serving /check_writability GET request");
let res = crate::checker::check_writability(&state).await;
let res = crate::checker::check_writability(&compute).await;
match res {
Ok(_) => Response::new(Body::from("true")),
Err(e) => Response::new(Body::from(e.to_string())),
}
}
(&Method::POST, "/check_writability") => {
info!("serving /check_writability POST request");
let res = crate::checker::check_writability(&compute).await;
match res {
Ok(_) => Response::new(Body::from("true")),
Err(e) => Response::new(Body::from(e.to_string())),
@@ -49,7 +76,7 @@ async fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Respons
// Main Hyper HTTP server function that runs it and blocks waiting on it forever.
#[tokio::main]
async fn serve(state: Arc<RwLock<ComputeState>>) {
async fn serve(state: Arc<ComputeNode>) {
let addr = SocketAddr::from(([0, 0, 0, 0], 3080));
let make_service = make_service_fn(move |_conn| {
@@ -73,7 +100,7 @@ async fn serve(state: Arc<RwLock<ComputeState>>) {
}
/// Launch a separate Hyper HTTP API server thread and return its `JoinHandle`.
pub fn launch_http_server(state: &Arc<RwLock<ComputeState>>) -> Result<thread::JoinHandle<()>> {
pub fn launch_http_server(state: &Arc<ComputeNode>) -> Result<thread::JoinHandle<()>> {
let state = Arc::clone(state);
Ok(thread::Builder::new()

View File

@@ -0,0 +1 @@
pub mod api;

View File

@@ -0,0 +1,158 @@
openapi: "3.0.2"
info:
title: Compute node control API
version: "1.0"
servers:
- url: "http://localhost:3080"
paths:
/status:
get:
tags:
- "info"
summary: Get compute node internal status
description: ""
operationId: getComputeStatus
responses:
"200":
description: ComputeState
content:
application/json:
schema:
$ref: "#/components/schemas/ComputeState"
/metrics.json:
get:
tags:
- "info"
summary: Get compute node startup metrics in JSON format
description: ""
operationId: getComputeMetricsJSON
responses:
"200":
description: ComputeMetrics
content:
application/json:
schema:
$ref: "#/components/schemas/ComputeMetrics"
/ready:
get:
deprecated: true
tags:
- "info"
summary: Check whether compute startup process finished successfully
description: ""
operationId: computeIsReady
responses:
"200":
description: Compute is ready ('true') or not ('false')
content:
text/plain:
schema:
type: string
example: "true"
/last_activity:
get:
deprecated: true
tags:
- "info"
summary: Get timestamp of the last compute activity
description: ""
operationId: getLastComputeActivityTS
responses:
"200":
description: Timestamp of the last compute activity
content:
text/plain:
schema:
type: string
example: "2022-10-12T07:20:50.52Z"
/check_writability:
get:
deprecated: true
tags:
- "check"
summary: Check that we can write new data on this compute
description: ""
operationId: checkComputeWritabilityDeprecated
responses:
"200":
description: Check result
content:
text/plain:
schema:
type: string
description: Error text or 'true' if check passed
example: "true"
post:
tags:
- "check"
summary: Check that we can write new data on this compute
description: ""
operationId: checkComputeWritability
responses:
"200":
description: Check result
content:
text/plain:
schema:
type: string
description: Error text or 'true' if check passed
example: "true"
components:
securitySchemes:
JWT:
type: http
scheme: bearer
bearerFormat: JWT
schemas:
ComputeMetrics:
type: object
description: Compute startup metrics
required:
- sync_safekeepers_ms
- basebackup_ms
- config_ms
- total_startup_ms
properties:
sync_safekeepers_ms:
type: integer
basebackup_ms:
type: integer
config_ms:
type: integer
total_startup_ms:
type: integer
ComputeState:
type: object
required:
- status
- last_active
properties:
status:
$ref: '#/components/schemas/ComputeStatus'
last_active:
type: string
description: The last detected compute activity timestamp in UTC and RFC3339 format
example: "2022-10-12T07:20:50.52Z"
error:
type: string
description: Text of the error during compute startup, if any
ComputeStatus:
type: string
enum:
- init
- failed
- running
security:
- JWT: []

View File

@@ -4,11 +4,12 @@
//!
pub mod checker;
pub mod config;
pub mod http_api;
pub mod http;
#[macro_use]
pub mod logger;
pub mod compute;
pub mod monitor;
pub mod params;
#[allow(clippy::format_push_string)] // Clippy's suggestion doesn't actually work
pub mod pg_helpers;
pub mod spec;
pub mod zenith;

View File

@@ -1,4 +1,4 @@
use std::sync::{Arc, RwLock};
use std::sync::Arc;
use std::{thread, time};
use anyhow::Result;
@@ -6,16 +6,16 @@ use chrono::{DateTime, Utc};
use log::{debug, info};
use postgres::{Client, NoTls};
use crate::zenith::ComputeState;
use crate::compute::ComputeNode;
const MONITOR_CHECK_INTERVAL: u64 = 500; // milliseconds
// Spin in a loop and figure out the last activity time in the Postgres.
// Then update it in the shared state. This function never errors out.
// XXX: the only expected panic is at `RwLock` unwrap().
fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
fn watch_compute_activity(compute: &Arc<ComputeNode>) {
// Suppose that `connstr` doesn't change
let connstr = state.read().unwrap().connstr.clone();
let connstr = compute.connstr.clone();
// Define `client` outside of the loop to reuse existing connection if it's active.
let mut client = Client::connect(&connstr, NoTls);
let timeout = time::Duration::from_millis(MONITOR_CHECK_INTERVAL);
@@ -46,7 +46,7 @@ fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
AND usename != 'zenith_admin';", // XXX: find a better way to filter other monitors?
&[],
);
let mut last_active = state.read().unwrap().last_active;
let mut last_active = compute.state.read().unwrap().last_active;
if let Ok(backs) = backends {
let mut idle_backs: Vec<DateTime<Utc>> = vec![];
@@ -83,14 +83,14 @@ fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
}
// Update the last activity in the shared state if we got a more recent one.
let mut state = state.write().unwrap();
let mut state = compute.state.write().unwrap();
if last_active > state.last_active {
state.last_active = last_active;
debug!("set the last compute activity time to: {}", last_active);
}
}
Err(e) => {
info!("cannot connect to postgres: {}, retrying", e);
debug!("cannot connect to postgres: {}, retrying", e);
// Establish a new connection and try again.
client = Client::connect(&connstr, NoTls);
@@ -100,7 +100,7 @@ fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
}
/// Launch a separate compute monitor thread and return its `JoinHandle`.
pub fn launch_monitor(state: &Arc<RwLock<ComputeState>>) -> Result<thread::JoinHandle<()>> {
pub fn launch_monitor(state: &Arc<ComputeNode>) -> Result<thread::JoinHandle<()>> {
let state = Arc::clone(state);
Ok(thread::Builder::new()

View File

@@ -1,7 +1,9 @@
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::net::{SocketAddr, TcpStream};
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::Command;
use std::process::Child;
use std::str::FromStr;
use std::{fs, thread, time};
@@ -220,12 +222,12 @@ pub fn get_existing_dbs(client: &mut Client) -> Result<Vec<Database>> {
/// Wait for Postgres to become ready to accept connections:
/// - state should be `ready` in the `pgdata/postmaster.pid`
/// - and we should be able to connect to 127.0.0.1:5432
pub fn wait_for_postgres(port: &str, pgdata: &Path) -> Result<()> {
pub fn wait_for_postgres(pg: &mut Child, port: &str, pgdata: &Path) -> Result<()> {
let pid_path = pgdata.join("postmaster.pid");
let mut slept: u64 = 0; // ms
let pause = time::Duration::from_millis(100);
let timeout = time::Duration::from_millis(200);
let timeout = time::Duration::from_millis(10);
let addr = SocketAddr::from_str(&format!("127.0.0.1:{}", port)).unwrap();
loop {
@@ -236,14 +238,19 @@ pub fn wait_for_postgres(port: &str, pgdata: &Path) -> Result<()> {
bail!("timed out while waiting for Postgres to start");
}
if let Ok(Some(status)) = pg.try_wait() {
// Postgres exited, that is not what we expected, bail out earlier.
let code = status.code().unwrap_or(-1);
bail!("Postgres exited unexpectedly with code {}", code);
}
if pid_path.exists() {
// XXX: dumb and the simplest way to get the last line in a text file
// TODO: better use `.lines().last()` later
let stdout = Command::new("tail")
.args(&["-n1", pid_path.to_str().unwrap()])
.output()?
.stdout;
let status = String::from_utf8(stdout)?;
let file = BufReader::new(File::open(&pid_path)?);
let status = file
.lines()
.last()
.unwrap()
.unwrap_or_else(|_| "unknown".to_string());
let can_connect = TcpStream::connect_timeout(&addr, timeout).is_ok();
// Now Postgres is ready to accept connections

View File

@@ -3,16 +3,53 @@ use std::path::Path;
use anyhow::Result;
use log::{info, log_enabled, warn, Level};
use postgres::Client;
use serde::Deserialize;
use crate::config;
use crate::params::PG_HBA_ALL_MD5;
use crate::pg_helpers::*;
use crate::zenith::ClusterSpec;
/// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description.
#[derive(Clone, Deserialize)]
pub struct ComputeSpec {
pub format_version: f32,
pub timestamp: String,
pub operation_uuid: Option<String>,
/// Expected cluster state at the end of transition process.
pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>,
}
/// Cluster state seen from the perspective of the external tools
/// like Rails web console.
#[derive(Clone, Deserialize)]
pub struct Cluster {
pub cluster_id: String,
pub name: String,
pub state: Option<String>,
pub roles: Vec<Role>,
pub databases: Vec<Database>,
pub settings: GenericOptions,
}
/// Single cluster state changing operation that could not be represented as
/// a static `Cluster` structure. For example:
/// - DROP DATABASE
/// - DROP ROLE
/// - ALTER ROLE name RENAME TO new_name
/// - ALTER DATABASE name RENAME TO new_name
#[derive(Clone, Deserialize)]
pub struct DeltaOp {
pub action: String,
pub name: PgIdent,
pub new_name: Option<PgIdent>,
}
/// It takes cluster specification and does the following:
/// - Serialize cluster config and put it into `postgresql.conf` completely rewriting the file.
/// - Update `pg_hba.conf` to allow external connections.
pub fn handle_configuration(spec: &ClusterSpec, pgdata_path: &Path) -> Result<()> {
pub fn handle_configuration(spec: &ComputeSpec, pgdata_path: &Path) -> Result<()> {
// File `postgresql.conf` is no longer included into `basebackup`, so just
// always write all config into it creating new file.
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
@@ -39,7 +76,7 @@ pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
/// Given a cluster spec json and open transaction it handles roles creation,
/// deletion and update.
pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let mut xact = client.transaction()?;
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
@@ -136,13 +173,20 @@ pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
xact.execute(query.as_str(), &[])?;
}
} else {
info!("role name {}", &name);
info!("role name: '{}'", &name);
let mut query: String = format!("CREATE ROLE {} ", name.quote());
info!("role create query {}", &query);
info!("role create query: '{}'", &query);
info_print!(" -> create");
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
let grant_query = format!(
"grant pg_read_all_data, pg_write_all_data to {}",
name.quote()
);
xact.execute(grant_query.as_str(), &[])?;
info!("role grant query: '{}'", &grant_query);
}
info_print!("\n");
@@ -158,7 +202,7 @@ pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
/// like `CREATE DATABASE` and `DROP DATABASE` do not support it. Statement-level
/// atomicity should be enough here due to the order of operations and various checks,
/// which together provide us idempotency.
pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
// Print a list of existing Postgres databases (only in debug mode)
@@ -247,7 +291,7 @@ pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
// Grant CREATE ON DATABASE to the database owner
// to allow clients create trusted extensions.
pub fn handle_grants(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
pub fn handle_grants(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
info!("cluster spec grants:");
for db in &spec.cluster.databases {

View File

@@ -1,109 +0,0 @@
use std::process::{Command, Stdio};
use anyhow::Result;
use chrono::{DateTime, Utc};
use postgres::{Client, NoTls};
use serde::Deserialize;
use crate::pg_helpers::*;
/// Compute node state shared across several `zenith_ctl` threads.
/// Should be used under `RwLock` to allow HTTP API server to serve
/// status requests, while configuration is in progress.
pub struct ComputeState {
pub connstr: String,
pub pgdata: String,
pub pgbin: String,
pub spec: ClusterSpec,
/// Compute setup process has finished
pub ready: bool,
/// Timestamp of the last Postgres activity
pub last_active: DateTime<Utc>,
}
/// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description.
#[derive(Clone, Deserialize)]
pub struct ClusterSpec {
pub format_version: f32,
pub timestamp: String,
pub operation_uuid: Option<String>,
/// Expected cluster state at the end of transition process.
pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>,
}
/// Cluster state seen from the perspective of the external tools
/// like Rails web console.
#[derive(Clone, Deserialize)]
pub struct Cluster {
pub cluster_id: String,
pub name: String,
pub state: Option<String>,
pub roles: Vec<Role>,
pub databases: Vec<Database>,
pub settings: GenericOptions,
}
/// Single cluster state changing operation that could not be represented as
/// a static `Cluster` structure. For example:
/// - DROP DATABASE
/// - DROP ROLE
/// - ALTER ROLE name RENAME TO new_name
/// - ALTER DATABASE name RENAME TO new_name
#[derive(Clone, Deserialize)]
pub struct DeltaOp {
pub action: String,
pub name: PgIdent,
pub new_name: Option<PgIdent>,
}
/// Get basebackup from the libpq connection to pageserver using `connstr` and
/// unarchive it to `pgdata` directory overriding all its previous content.
pub fn get_basebackup(
pgdata: &str,
connstr: &str,
tenant: &str,
timeline: &str,
lsn: &str,
) -> Result<()> {
let mut client = Client::connect(connstr, NoTls)?;
let basebackup_cmd = match lsn {
"0/0" => format!("basebackup {} {}", tenant, timeline), // First start of the compute
_ => format!("basebackup {} {} {}", tenant, timeline, lsn),
};
let copyreader = client.copy_out(basebackup_cmd.as_str())?;
let mut ar = tar::Archive::new(copyreader);
ar.unpack(&pgdata)?;
Ok(())
}
/// Run `postgres` in a special mode with `--sync-safekeepers` argument
/// and return the reported LSN back to the caller.
pub fn sync_safekeepers(pgdata: &str, pgbin: &str) -> Result<String> {
let sync_handle = Command::new(&pgbin)
.args(&["--sync-safekeepers"])
.env("PGDATA", &pgdata) // we cannot use -D in this mode
.stdout(Stdio::piped())
.spawn()
.expect("postgres --sync-safekeepers failed to start");
// `postgres --sync-safekeepers` will print all log output to stderr and
// final LSN to stdout. So we pipe only stdout, while stderr will be automatically
// redirected to the caller output.
let sync_output = sync_handle
.wait_with_output()
.expect("postgres --sync-safekeepers failed");
if !sync_output.status.success() {
anyhow::bail!(
"postgres --sync-safekeepers exited with non-zero status: {}",
sync_output.status,
);
}
let lsn = String::from(String::from_utf8(sync_output.stdout)?.trim());
Ok(lsn)
}

View File

@@ -4,12 +4,12 @@ mod pg_helpers_tests {
use std::fs::File;
use compute_tools::pg_helpers::*;
use compute_tools::zenith::ClusterSpec;
use compute_tools::spec::ComputeSpec;
#[test]
fn params_serialize() {
let file = File::open("tests/cluster_spec.json").unwrap();
let spec: ClusterSpec = serde_json::from_reader(file).unwrap();
let spec: ComputeSpec = serde_json::from_reader(file).unwrap();
assert_eq!(
spec.cluster.databases.first().unwrap().to_pg_options(),
@@ -24,7 +24,7 @@ mod pg_helpers_tests {
#[test]
fn settings_serialize() {
let file = File::open("tests/cluster_spec.json").unwrap();
let spec: ClusterSpec = serde_json::from_reader(file).unwrap();
let spec: ComputeSpec = serde_json::from_reader(file).unwrap();
assert_eq!(
spec.cluster.settings.as_pg_settings(),

View File

@@ -9,3 +9,6 @@ auth_type = 'Trust'
id = 1
pg_port = 5454
http_port = 7676
[etcd_broker]
broker_endpoints = ['http://127.0.0.1:2379']

View File

@@ -274,6 +274,8 @@ impl PostgresNode {
conf.append("listen_addresses", &self.address.ip().to_string());
conf.append("port", &self.address.port().to_string());
conf.append("wal_keep_size", "0");
// walproposer panics when basebackup is invalid, it is pointless to restart in this case.
conf.append("restart_after_crash", "off");
// Configure the node to fetch pages from pageserver
let pageserver_connstr = {

93
control_plane/src/etcd.rs Normal file
View File

@@ -0,0 +1,93 @@
use std::{
fs,
path::PathBuf,
process::{Command, Stdio},
};
use anyhow::Context;
use nix::{
sys::signal::{kill, Signal},
unistd::Pid,
};
use crate::{local_env, read_pidfile};
pub fn start_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let etcd_broker = &env.etcd_broker;
println!(
"Starting etcd broker using {}",
etcd_broker.etcd_binary_path.display()
);
let etcd_data_dir = env.base_data_dir.join("etcd");
fs::create_dir_all(&etcd_data_dir).with_context(|| {
format!(
"Failed to create etcd data dir: {}",
etcd_data_dir.display()
)
})?;
let etcd_stdout_file =
fs::File::create(etcd_data_dir.join("etcd.stdout.log")).with_context(|| {
format!(
"Failed to create ectd stout file in directory {}",
etcd_data_dir.display()
)
})?;
let etcd_stderr_file =
fs::File::create(etcd_data_dir.join("etcd.stderr.log")).with_context(|| {
format!(
"Failed to create ectd stderr file in directory {}",
etcd_data_dir.display()
)
})?;
let client_urls = etcd_broker.comma_separated_endpoints();
let etcd_process = Command::new(&etcd_broker.etcd_binary_path)
.args(&[
format!("--data-dir={}", etcd_data_dir.display()),
format!("--listen-client-urls={client_urls}"),
format!("--advertise-client-urls={client_urls}"),
])
.stdout(Stdio::from(etcd_stdout_file))
.stderr(Stdio::from(etcd_stderr_file))
.spawn()
.context("Failed to spawn etcd subprocess")?;
let pid = etcd_process.id();
let etcd_pid_file_path = etcd_pid_file_path(env);
fs::write(&etcd_pid_file_path, pid.to_string()).with_context(|| {
format!(
"Failed to create etcd pid file at {}",
etcd_pid_file_path.display()
)
})?;
Ok(())
}
pub fn stop_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let etcd_path = &env.etcd_broker.etcd_binary_path;
println!("Stopping etcd broker at {}", etcd_path.display());
let etcd_pid_file_path = etcd_pid_file_path(env);
let pid = Pid::from_raw(read_pidfile(&etcd_pid_file_path).with_context(|| {
format!(
"Failed to read etcd pid filea at {}",
etcd_pid_file_path.display()
)
})?);
kill(pid, Signal::SIGTERM).with_context(|| {
format!(
"Failed to stop etcd with pid {pid} at {}",
etcd_pid_file_path.display()
)
})?;
Ok(())
}
fn etcd_pid_file_path(env: &local_env::LocalEnv) -> PathBuf {
env.base_data_dir.join("etcd.pid")
}

View File

@@ -12,6 +12,7 @@ use std::path::Path;
use std::process::Command;
pub mod compute;
pub mod etcd;
pub mod local_env;
pub mod postgresql_conf;
pub mod safekeeper;

View File

@@ -4,6 +4,7 @@
//! script which will use local paths.
use anyhow::{bail, ensure, Context};
use reqwest::Url;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
@@ -59,9 +60,7 @@ pub struct LocalEnv {
#[serde(default)]
pub private_key_path: PathBuf,
// A comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'.
#[serde(default)]
pub broker_endpoints: Option<String>,
pub etcd_broker: EtcdBroker,
pub pageserver: PageServerConf,
@@ -77,6 +76,62 @@ pub struct LocalEnv {
branch_name_mappings: HashMap<String, Vec<(ZTenantId, ZTimelineId)>>,
}
/// Etcd broker config for cluster internal communication.
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct EtcdBroker {
/// A prefix to all to any key when pushing/polling etcd from a node.
#[serde(default)]
pub broker_etcd_prefix: Option<String>,
/// Broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'.
#[serde(default)]
#[serde_as(as = "Vec<DisplayFromStr>")]
pub broker_endpoints: Vec<Url>,
/// Etcd binary path to use.
#[serde(default)]
pub etcd_binary_path: PathBuf,
}
impl EtcdBroker {
pub fn locate_etcd() -> anyhow::Result<PathBuf> {
let which_output = Command::new("which")
.arg("etcd")
.output()
.context("Failed to run 'which etcd' command")?;
let stdout = String::from_utf8_lossy(&which_output.stdout);
ensure!(
which_output.status.success(),
"'which etcd' invocation failed. Status: {}, stdout: {stdout}, stderr: {}",
which_output.status,
String::from_utf8_lossy(&which_output.stderr)
);
let etcd_path = PathBuf::from(stdout.trim());
ensure!(
etcd_path.is_file(),
"'which etcd' invocation was successful, but the path it returned is not a file or does not exist: {}",
etcd_path.display()
);
Ok(etcd_path)
}
pub fn comma_separated_endpoints(&self) -> String {
self.broker_endpoints.iter().map(Url::as_str).fold(
String::new(),
|mut comma_separated_urls, url| {
if !comma_separated_urls.is_empty() {
comma_separated_urls.push(',');
}
comma_separated_urls.push_str(url);
comma_separated_urls
},
)
}
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
#[serde(default)]
pub struct PageServerConf {
@@ -180,12 +235,7 @@ impl LocalEnv {
if old_timeline_id == &timeline_id {
Ok(())
} else {
bail!(
"branch '{}' is already mapped to timeline {}, cannot map to another timeline {}",
branch_name,
old_timeline_id,
timeline_id
);
bail!("branch '{branch_name}' is already mapped to timeline {old_timeline_id}, cannot map to another timeline {timeline_id}");
}
} else {
existing_values.push((tenant_id, timeline_id));
@@ -221,7 +271,7 @@ impl LocalEnv {
///
/// Unlike 'load_config', this function fills in any defaults that are missing
/// from the config file.
pub fn create_config(toml: &str) -> anyhow::Result<Self> {
pub fn parse_config(toml: &str) -> anyhow::Result<Self> {
let mut env: LocalEnv = toml::from_str(toml)?;
// Find postgres binaries.
@@ -234,26 +284,11 @@ impl LocalEnv {
env.pg_distrib_dir = cwd.join("tmp_install")
}
}
if !env.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
env.pg_distrib_dir.display()
);
}
// Find zenith binaries.
if env.zenith_distrib_dir == Path::new("") {
env.zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
}
for binary in ["pageserver", "safekeeper"] {
if !env.zenith_distrib_dir.join(binary).exists() {
bail!(
"Can't find binary '{}' in zenith distrib dir '{}'",
binary,
env.zenith_distrib_dir.display()
);
}
}
// If no initial tenant ID was given, generate it.
if env.default_tenant_id.is_none() {
@@ -347,6 +382,36 @@ impl LocalEnv {
"directory '{}' already exists. Perhaps already initialized?",
base_path.display()
);
if !self.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
self.pg_distrib_dir.display()
);
}
for binary in ["pageserver", "safekeeper"] {
if !self.zenith_distrib_dir.join(binary).exists() {
bail!(
"Can't find binary '{}' in zenith distrib dir '{}'",
binary,
self.zenith_distrib_dir.display()
);
}
}
for binary in ["pageserver", "safekeeper"] {
if !self.zenith_distrib_dir.join(binary).exists() {
bail!(
"Can't find binary '{binary}' in zenith distrib dir '{}'",
self.zenith_distrib_dir.display()
);
}
}
if !self.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
self.pg_distrib_dir.display()
);
}
fs::create_dir(&base_path)?;
@@ -404,7 +469,35 @@ impl LocalEnv {
fn base_path() -> PathBuf {
match std::env::var_os("ZENITH_REPO_DIR") {
Some(val) => PathBuf::from(val.to_str().unwrap()),
None => ".zenith".into(),
Some(val) => PathBuf::from(val),
None => PathBuf::from(".zenith"),
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn simple_conf_parsing() {
let simple_conf_toml = include_str!("../simple.conf");
let simple_conf_parse_result = LocalEnv::parse_config(simple_conf_toml);
assert!(
simple_conf_parse_result.is_ok(),
"failed to parse simple config {simple_conf_toml}, reason: {simple_conf_parse_result:?}"
);
let string_to_replace = "broker_endpoints = ['http://127.0.0.1:2379']";
let spoiled_url_str = "broker_endpoints = ['!@$XOXO%^&']";
let spoiled_url_toml = simple_conf_toml.replace(string_to_replace, spoiled_url_str);
assert!(
spoiled_url_toml.contains(spoiled_url_str),
"Failed to replace string {string_to_replace} in the toml file {simple_conf_toml}"
);
let spoiled_url_parse_result = LocalEnv::parse_config(&spoiled_url_toml);
assert!(
spoiled_url_parse_result.is_err(),
"expected toml with invalid Url {spoiled_url_toml} to fail the parsing, but got {spoiled_url_parse_result:?}"
);
}
}

View File

@@ -52,7 +52,7 @@ impl ResponseErrorMessageExt for Response {
Err(SafekeeperHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
Err(_) => format!("Http error ({}) at {url}.", status.as_u16()),
},
))
}
@@ -75,16 +75,12 @@ pub struct SafekeeperNode {
pub http_base_url: String,
pub pageserver: Arc<PageServerNode>,
broker_endpoints: Option<String>,
}
impl SafekeeperNode {
pub fn from_env(env: &LocalEnv, conf: &SafekeeperConf) -> SafekeeperNode {
let pageserver = Arc::new(PageServerNode::from_env(env));
println!("initializing for sk {} for {}", conf.id, conf.http_port);
SafekeeperNode {
id: conf.id,
conf: conf.clone(),
@@ -93,7 +89,6 @@ impl SafekeeperNode {
http_client: Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
pageserver,
broker_endpoints: env.broker_endpoints.clone(),
}
}
@@ -140,8 +135,13 @@ impl SafekeeperNode {
if !self.conf.sync {
cmd.arg("--no-sync");
}
if let Some(ref ep) = self.broker_endpoints {
cmd.args(&["--broker-endpoints", ep]);
let comma_separated_endpoints = self.env.etcd_broker.comma_separated_endpoints();
if !comma_separated_endpoints.is_empty() {
cmd.args(&["--broker-endpoints", &comma_separated_endpoints]);
}
if let Some(prefix) = self.env.etcd_broker.broker_etcd_prefix.as_deref() {
cmd.args(&["--broker-etcd-prefix", prefix]);
}
if !cmd.status()?.success() {
@@ -205,12 +205,13 @@ impl SafekeeperNode {
let pid = Pid::from_raw(pid);
let sig = if immediate {
println!("Stop safekeeper immediately");
print!("Stopping safekeeper {} immediately..", self.id);
Signal::SIGQUIT
} else {
println!("Stop safekeeper gracefully");
print!("Stopping safekeeper {} gracefully..", self.id);
Signal::SIGTERM
};
io::stdout().flush().unwrap();
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
@@ -232,25 +233,35 @@ impl SafekeeperNode {
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if safekeeper flushes a lot of data
let mut tcp_stopped = false;
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Safekeeper stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Safekeeper status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Safekeeper status is: {}", err);
return Ok(());
if !tcp_stopped {
if let Err(err) = TcpStream::connect(&address) {
tcp_stopped = true;
if err.kind() != io::ErrorKind::ConnectionRefused {
eprintln!("\nSafekeeper connection failed with error: {err}");
}
}
} else {
println!("Safekeeper still receives connections");
thread::sleep(Duration::from_secs(1));
}
if tcp_stopped {
// Also check status on the HTTP port
match self.check_status() {
Err(SafekeeperHttpError::Transport(err)) if err.is_connect() => {
println!("done!");
return Ok(());
}
Err(err) => {
eprintln!("\nSafekeeper status check failed with error: {err}");
return Ok(());
}
Ok(()) => {
// keep waiting
}
}
}
print!(".");
io::stdout().flush().unwrap();
thread::sleep(Duration::from_secs(1));
}
bail!("Failed to stop safekeeper with pid {}", pid);

View File

@@ -1,5 +1,6 @@
use std::collections::HashMap;
use std::io::Write;
use std::fs::File;
use std::io::{BufReader, Write};
use std::net::TcpStream;
use std::path::PathBuf;
use std::process::Command;
@@ -121,6 +122,16 @@ impl PageServerNode {
);
let listen_pg_addr_param =
format!("listen_pg_addr='{}'", self.env.pageserver.listen_pg_addr);
let broker_endpoints_param = format!(
"broker_endpoints=[{}]",
self.env
.etcd_broker
.broker_endpoints
.iter()
.map(|url| format!("'{url}'"))
.collect::<Vec<_>>()
.join(",")
);
let mut args = Vec::with_capacity(20);
args.push("--init");
@@ -129,8 +140,19 @@ impl PageServerNode {
args.extend(["-c", &authg_type_param]);
args.extend(["-c", &listen_http_addr_param]);
args.extend(["-c", &listen_pg_addr_param]);
args.extend(["-c", &broker_endpoints_param]);
args.extend(["-c", &id]);
let broker_etcd_prefix_param = self
.env
.etcd_broker
.broker_etcd_prefix
.as_ref()
.map(|prefix| format!("broker_etcd_prefix='{prefix}'"));
if let Some(broker_etcd_prefix_param) = broker_etcd_prefix_param.as_deref() {
args.extend(["-c", broker_etcd_prefix_param]);
}
for config_override in config_overrides {
args.extend(["-c", config_override]);
}
@@ -167,6 +189,9 @@ impl PageServerNode {
);
}
// echo the captured output of the init command
println!("{}", String::from_utf8_lossy(&init_output.stdout));
Ok(initial_timeline_id)
}
@@ -186,8 +211,6 @@ impl PageServerNode {
);
io::stdout().flush().unwrap();
let mut cmd = Command::new(self.env.pageserver_bin()?);
let repo_path = self.repo_path();
let mut args = vec!["-D", repo_path.to_str().unwrap()];
@@ -195,9 +218,11 @@ impl PageServerNode {
args.extend(["-c", config_override]);
}
fill_rust_env_vars(cmd.args(&args).arg("--daemonize"));
let mut cmd = Command::new(self.env.pageserver_bin()?);
let mut filled_cmd = fill_rust_env_vars(cmd.args(&args).arg("--daemonize"));
filled_cmd = fill_aws_secrets_vars(filled_cmd);
if !cmd.status()?.success() {
if !filled_cmd.status()?.success() {
bail!(
"Pageserver failed to start. See '{}' for details.",
self.repo_path().join("pageserver.log").display()
@@ -257,12 +282,13 @@ impl PageServerNode {
let pid = Pid::from_raw(read_pidfile(&pid_file)?);
let sig = if immediate {
println!("Stop pageserver immediately");
print!("Stopping pageserver immediately..");
Signal::SIGQUIT
} else {
println!("Stop pageserver gracefully");
print!("Stopping pageserver gracefully..");
Signal::SIGTERM
};
io::stdout().flush().unwrap();
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
@@ -284,25 +310,36 @@ impl PageServerNode {
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if pageserver checkpoints a lot of data
let mut tcp_stopped = false;
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Pageserver stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Pageserver status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Pageserver status is: {}", err);
return Ok(());
if !tcp_stopped {
if let Err(err) = TcpStream::connect(&address) {
tcp_stopped = true;
if err.kind() != io::ErrorKind::ConnectionRefused {
eprintln!("\nPageserver connection failed with error: {err}");
}
}
} else {
println!("Pageserver still receives connections");
thread::sleep(Duration::from_secs(1));
}
if tcp_stopped {
// Also check status on the HTTP port
match self.check_status() {
Err(PageserverHttpError::Transport(err)) if err.is_connect() => {
println!("done!");
return Ok(());
}
Err(err) => {
eprintln!("\nPageserver status check failed with error: {err}");
return Ok(());
}
Ok(()) => {
// keep waiting
}
}
}
print!(".");
io::stdout().flush().unwrap();
thread::sleep(Duration::from_secs(1));
}
bail!("Failed to stop pageserver with pid {}", pid);
@@ -369,6 +406,10 @@ impl PageServerNode {
.map(|x| x.parse::<u64>())
.transpose()?,
gc_period: settings.get("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.get("image_creation_threshold")
.map(|x| x.parse::<usize>())
.transpose()?,
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
})
.send()?
@@ -405,6 +446,9 @@ impl PageServerNode {
.get("gc_horizon")
.map(|x| x.parse::<u64>().unwrap()),
gc_period: settings.get("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.get("image_creation_threshold")
.map(|x| x.parse::<usize>().unwrap()),
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
})
.send()?
@@ -449,4 +493,63 @@ impl PageServerNode {
Ok(timeline_info_response)
}
/// Import a basebackup prepared using either:
/// a) `pg_basebackup -F tar`, or
/// b) The `fullbackup` pageserver endpoint
///
/// # Arguments
/// * `tenant_id` - tenant to import into. Created if not exists
/// * `timeline_id` - id to assign to imported timeline
/// * `base` - (start lsn of basebackup, path to `base.tar` file)
/// * `pg_wal` - if there's any wal to import: (end lsn, path to `pg_wal.tar`)
pub fn timeline_import(
&self,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
base: (Lsn, PathBuf),
pg_wal: Option<(Lsn, PathBuf)>,
) -> anyhow::Result<()> {
let mut client = self.pg_connection_config.connect(NoTls).unwrap();
// Init base reader
let (start_lsn, base_tarfile_path) = base;
let base_tarfile = File::open(base_tarfile_path)?;
let mut base_reader = BufReader::new(base_tarfile);
// Init wal reader if necessary
let (end_lsn, wal_reader) = if let Some((end_lsn, wal_tarfile_path)) = pg_wal {
let wal_tarfile = File::open(wal_tarfile_path)?;
let wal_reader = BufReader::new(wal_tarfile);
(end_lsn, Some(wal_reader))
} else {
(start_lsn, None)
};
// Import base
let import_cmd =
format!("import basebackup {tenant_id} {timeline_id} {start_lsn} {end_lsn}");
let mut writer = client.copy_in(&import_cmd)?;
io::copy(&mut base_reader, &mut writer)?;
writer.finish()?;
// Import wal if necessary
if let Some(mut wal_reader) = wal_reader {
let import_cmd = format!("import wal {tenant_id} {timeline_id} {start_lsn} {end_lsn}");
let mut writer = client.copy_in(&import_cmd)?;
io::copy(&mut wal_reader, &mut writer)?;
writer.finish()?;
}
Ok(())
}
}
fn fill_aws_secrets_vars(mut cmd: &mut Command) -> &mut Command {
for env_key in ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"] {
if let Ok(value) = std::env::var(env_key) {
cmd = cmd.env(env_key, value);
}
}
cmd
}

View File

@@ -1,13 +1,20 @@
#!/bin/sh
set -eux
broker_endpoints_param="${BROKER_ENDPOINT:-absent}"
if [ "$broker_endpoints_param" != "absent" ]; then
broker_endpoints_param="-c broker_endpoints=['$broker_endpoints_param']"
else
broker_endpoints_param=''
fi
if [ "$1" = 'pageserver' ]; then
if [ ! -d "/data/tenants" ]; then
echo "Initializing pageserver data directory"
pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10"
pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10" $broker_endpoints_param
fi
echo "Staring pageserver at 0.0.0.0:6400"
pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /data
pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" $broker_endpoints_param -D /data
else
"$@"
fi

View File

@@ -7,8 +7,8 @@
- [glossary.md](glossary.md) — Glossary of all the terms used in codebase.
- [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
- [sourcetree.md](sourcetree.md) — Overview of the source tree layeout.
- [pageserver/README](/pageserver/README) — pageserver overview.
- [postgres_ffi/README](/libs/postgres_ffi/README) — Postgres FFI overview.
- [pageserver/README.md](/pageserver/README.md) — pageserver overview.
- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
- [safekeeper/README](/safekeeper/README) — WAL service overview.
- [safekeeper/README.md](/safekeeper/README.md) — WAL service overview.
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core

View File

@@ -1,20 +1,20 @@
# Docker images of Zenith
# Docker images of Neon
## Images
Currently we build two main images:
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres).
- [neondatabase/neon](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [neondatabase/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [neondatabase/postgres](https://github.com/neondatabase/postgres).
And additional intermediate images:
And additional intermediate image:
- [zenithdb/compute-tools](https://hub.docker.com/repository/docker/zenithdb/compute-tools) — compute node configuration management tools.
- [neondatabase/compute-tools](https://hub.docker.com/repository/docker/neondatabase/compute-tools) — compute node configuration management tools.
## Building pipeline
1. Image `zenithdb/compute-tools` is re-built automatically.
We build all images after a successful `release` tests run and push automatically to Docker Hub with two parallel CI jobs
2. Image `zenithdb/compute-node` is built independently in the [zenithdb/postgres](https://github.com/zenithdb/postgres) repo.
1. `neondatabase/compute-tools` and `neondatabase/compute-node`
3. Image `zenithdb/zenith` is built in this repo after a successful `release` tests run and pushed to Docker Hub automatically.
2. `neondatabase/neon`

View File

@@ -21,7 +21,7 @@ NOTE:It has nothing to do with PostgreSQL pg_basebackup.
### Branch
We can create branch at certain LSN using `zenith timeline branch` command.
We can create branch at certain LSN using `neon_local timeline branch` command.
Each Branch lives in a corresponding timeline[] and has an ancestor[].
@@ -91,7 +91,7 @@ The layer map tracks what layers exist in a timeline.
### Layered repository
Zenith repository implementation that keeps data in layers.
Neon repository implementation that keeps data in layers.
### LSN
The Log Sequence Number (LSN) is a unique identifier of the WAL record[] in the WAL log.
@@ -101,7 +101,7 @@ It is printed as two hexadecimal numbers of up to 8 digits each, separated by a
Check also [PostgreSQL doc about pg_lsn type](https://www.postgresql.org/docs/devel/datatype-pg-lsn.html)
Values can be compared to calculate the volume of WAL data that separates them, so they are used to measure the progress of replication and recovery.
In postgres and Zenith lsns are used to describe certain points in WAL handling.
In Postgres and Neon LSNs are used to describe certain points in WAL handling.
PostgreSQL LSNs and functions to monitor them:
* `pg_current_wal_insert_lsn()` - Returns the current write-ahead log insert location.
@@ -111,13 +111,13 @@ PostgreSQL LSNs and functions to monitor them:
* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
[source PostgreSQL documentation](https://www.postgresql.org/docs/devel/functions-admin.html):
Zenith safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md)
Neon safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md)
* `CommitLSN`: position in WAL confirmed by quorum safekeepers.
* `RestartLSN`: position in WAL confirmed by all safekeepers.
* `FlushLSN`: part of WAL persisted to the disk by safekeeper.
* `VCL`: the largerst LSN for which we can guarantee availablity of all prior records.
Zenith pageserver LSNs:
Neon pageserver LSNs:
* `last_record_lsn` - the end of last processed WAL record.
* `disk_consistent_lsn` - data is known to be fully flushed and fsync'd to local disk on pageserver up to this LSN.
* `remote_consistent_lsn` - The last LSN that is synced to remote storage and is guaranteed to survive pageserver crash.
@@ -132,7 +132,7 @@ This is the unit of data exchange between compute node and pageserver.
### Pageserver
Zenith storage engine: repositories + wal receiver + page service + wal redo.
Neon storage engine: repositories + wal receiver + page service + wal redo.
### Page service
@@ -184,10 +184,10 @@ relation exceeds that size, it is split into multiple segments.
SLRUs include pg_clog, pg_multixact/members, and
pg_multixact/offsets. There are other SLRUs in PostgreSQL, but
they don't need to be stored permanently (e.g. pg_subtrans),
or we do not support them in zenith yet (pg_commit_ts).
or we do not support them in neon yet (pg_commit_ts).
### Tenant (Multitenancy)
Tenant represents a single customer, interacting with Zenith.
Tenant represents a single customer, interacting with Neon.
Wal redo[] activity, timelines[], layers[] are managed for each tenant independently.
One pageserver[] can serve multiple tenants at once.
One safekeeper

View File

@@ -22,7 +22,7 @@ In addition to the WAL safekeeper nodes, the WAL is archived in
S3. WAL that has been archived to S3 can be removed from the
safekeepers, so the safekeepers don't need a lot of disk space.
```
+----------------+
+-----> | WAL safekeeper |
| +----------------+
@@ -42,23 +42,23 @@ safekeepers, so the safekeepers don't need a lot of disk space.
\
\
\
\ +--------+
\ | |
+--> | S3 |
| |
+--------+
\ +--------+
\ | |
+------> | S3 |
| |
+--------+
```
Every WAL safekeeper holds a section of WAL, and a VCL value.
The WAL can be divided into three portions:
```
VCL LSN
| |
V V
.................ccccccccccccccccccccXXXXXXXXXXXXXXXXXXXXXXX
Archived WAL Completed WAL In-flight WAL
```
Note that all this WAL kept in a safekeeper is a contiguous section.
This is different from Aurora: In Aurora, there can be holes in the

View File

@@ -0,0 +1,151 @@
# Dispatching a connection
For each client connection, Neon service needs to authenticate the
connection, and route it to the right PostgreSQL instance.
## Authentication
There are three different ways to authenticate:
- anonymous; no authentication needed
- PostgreSQL authentication
- github single sign-on using browser
In anonymous access, the user doesn't need to perform any
authentication at all. This can be used e.g. in interactive PostgreSQL
documentation, allowing you to run the examples very quickly. Similar
to sqlfiddle.com.
PostgreSQL authentication works the same as always. All the different
PostgreSQL authentication options like SCRAM, kerberos, etc. are
available. [1]
The third option is to authenticate with github single sign-on. When
you open the connection in psql, you get a link that you open with
your browser. Opening the link redirects you to github authentication,
and lets the connection to proceed. This is also known as "Link auth" [2].
## Routing the connection
When a client starts a connection, it needs to be routed to the
correct PostgreSQL instance. Routing can be done by the proxy, acting
as a man-in-the-middle, or the connection can be routed at the network
level based on the hostname or IP address.
Either way, Neon needs to identify which PostgreSQL instance the
connection should be routed to. If the instance is not already
running, it needs to be started. Some connections always require a new
PostgreSQL instance to be created, e.g. if you want to run a one-off
query against a particular point-in-time.
The PostgreSQL instance is identified by:
- Neon account (possibly anonymous)
- cluster (known as tenant in the storage?)
- branch or snapshot name
- timestamp (PITR)
- primary or read-replica
- one-off read replica
- one-off writeable branch
When you are using regular PostgreSQL authentication or anonymous
access, the connection URL needs to contain all the information needed
for the routing. With github single sign-on, the browser is involved
and some details - the Neon account in particular - can be deduced
from the authentication exchange.
There are three methods for identifying the PostgreSQL instance:
- Browser interaction (link auth)
- Options in the connection URL and the domain name
- A pre-defined endpoint, identified by domain name or IP address
### Link Auth
postgres://<username>@start.neon.tech/<dbname>
This gives you a link that you open in browser. Clicking the link
performs github authentication, and the Neon account name is
provided to the proxy behind the scenes. The proxy routes the
connection to the primary PostgreSQL instance in cluster called
"main", branch "main".
Further ideas:
- You could pre-define a different target for link auth
connections in the UI.
- You could have a drop-down in the browser, allowing you to connect
to any cluster you want. Link Auth can be like Teleport.
### Connection URL
The connection URL looks like this:
postgres://<username>@<cluster-id>.db.neon.tech/<dbname>
By default, this connects you to the primary PostgreSQL instance
running on the "main" branch in the named cluster [3]. However, you can
change that by specifying options in the connection URL. The following
options are supported:
| option name | Description | Examples |
| --- | --- | --- |
| cluster | Cluster name | cluster:myproject |
| branch | Branch name | branch:main |
| timestamp | Connect to an instance at given point-in-time. | timestamp:2022-04-08 timestamp:2022-04-08T11:42:16Z |
| lsn | Connect to an instance at given LSN | lsn:0/12FF0420 |
| read-replica | Connect to a read-replica. If the parameter is 'new', a new instance is created for this session. | read-replica read-replica:new |
For example, to read branch 'testing' as it was on Mar 31, 2022, you could
specify a timestamp in the connection URL [4]:
postgres://alice@cluster-1234.db.neon.tech/postgres?options=branch:testing,timestamp:2022-03-31
Connecting with cluster name and options can be disabled in the UI. If
disabled, you can only connect using a pre-defined endpoint.
### Pre-defined Endpoint
Instead of providing the cluster name, branch, and all those options
in the connection URL, you can define a named endpoint with the same
options.
In the UI, click "create endpoint". Fill in the details:
- Cluster name
- Branch
- timestamp or LSN
- is this for the primary or for a read replica
- etc.
When you click Finish, a named endpoint is created. You can now use the endpoint ID to connect:
postgres://<username>@<endpoint-id>.endpoint.neon.tech/<dbname>
An endpoint can be assigned a static or dynamic IP address, so that
you can connect to it with clients that don't support TLS SNI. Maybe
bypass the proxy altogether, but that ought to be invisible to the
user.
You can limit the range of source IP addresses that are allowed to
connect to an endpoint. An endpoint can also be exposed in an Amazon
VPC, allowing direct connections from applications.
# Footnotes
[1] I'm not sure how feasible it is to set up configure like Kerberos
or LDAP in a cloud environment. But in principle I think we should
allow customers to have the full power of PostgreSQL, including all
authentication options. However, it's up to the customer to configure
it correctly.
[2] Link is a way to both authenticate and to route the connection
[3] This assumes that cluster-ids are globally unique, across all
Neon accounts.
[4] The syntax accepted in the connection URL is limited by libpq. The
only way to pass arbitrary options to the server (or our proxy) is
with the "options" keyword, and the options must be percent-encoded. I
think the above would work but i haven't tested it

View File

@@ -6,7 +6,6 @@ If there's no such file during `init` phase of the server, it creates the file i
There's a possibility to pass an arbitrary config value to the pageserver binary as an argument: such values override
the values in the config file, if any are specified for the same key and get into the final config during init phase.
### Config example
```toml
@@ -26,18 +25,22 @@ max_file_descriptors = '100'
# initial superuser role name to use when creating a new tenant
initial_superuser_name = 'zenith_admin'
broker_etcd_prefix = 'neon'
broker_endpoints = ['some://etcd']
# [remote_storage]
```
The config above shows default values for all basic pageserver settings.
The config above shows default values for all basic pageserver settings, besides `broker_endpoints`: that one has to be set by the user,
see the corresponding section below.
Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank.
Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start.
Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#table) in TOML specification and
* either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'`
- either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'`
* or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}`
- or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}`
### Config values
@@ -47,6 +50,17 @@ Example: `${PAGESERVER_BIN} -c "checkpoint_period = '100 s'" -c "remote_storage=
Note that TOML distinguishes between strings and integers, the former require single or double quotes around them.
#### broker_endpoints
A list of endpoints (etcd currently) to connect and pull the information from.
Mandatory, does not have a default, since requires etcd to be started as a separate process,
and its connection url should be specified separately.
#### broker_etcd_prefix
A prefix to add for every etcd key used, to separate one group of related instances from another, in the same cluster.
Default is `neon`.
#### checkpoint_distance
`checkpoint_distance` is the amount of incoming WAL that is held in
@@ -57,7 +71,7 @@ but it will trigger a checkpoint operation to get it back below the
limit.
`checkpoint_distance` also determines how much WAL needs to be kept
durable in the safekeeper. The safekeeper must have capacity to hold
durable in the safekeeper. The safekeeper must have capacity to hold
this much WAL, with some headroom, otherwise you can get stuck in a
situation where the safekeeper is full and stops accepting new WAL,
but the pageserver is not flushing out and releasing the space in the
@@ -72,7 +86,11 @@ The unit is # of bytes.
Every `compaction_period` seconds, the page server checks if
maintenance operations, like compaction, are needed on the layer
files. Default is 1 s, which should be fine.
files. Default is 1 s, which should be fine.
#### compaction_target_size
File sizes for L0 delta and L1 image layers. Default is 128MB.
#### gc_horizon
@@ -85,6 +103,14 @@ away.
Interval at which garbage collection is triggered. Default is 100 s.
#### image_creation_threshold
L0 delta layer threshold for L1 iamge layer creation. Default is 3.
#### pitr_interval
WAL retention duration for PITR branching. Default is 30 days.
#### initial_superuser_name
Name of the initial superuser role, passed to initdb when a new tenant
@@ -151,16 +177,12 @@ bucket_region = 'eu-north-1'
# Optional, pageserver uses entire bucket if the prefix is not specified.
prefix_in_bucket = '/some/prefix/'
# Access key to connect to the bucket ("login" part of the credentials)
access_key_id = 'SOMEKEYAAAAASADSAH*#'
# Secret access key to connect to the bucket ("password" part of the credentials)
secret_access_key = 'SOMEsEcReTsd292v'
# S3 API query limit to avoid getting errors/throttling from AWS.
concurrency_limit = 100
```
If no IAM bucket access is used during the remote storage usage, use the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables to set the access credentials.
###### General remote storage configuration
Pagesever allows only one remote storage configured concurrently and errors if parameters from multiple different remote configurations are used.
@@ -171,13 +193,12 @@ Besides, there are parameters common for all types of remote storage that can be
```toml
[remote_storage]
# Max number of concurrent timeline synchronized (layers uploaded or downloaded) with the remote storage at the same time.
max_concurrent_timelines_sync = 50
max_concurrent_syncs = 50
# Max number of errors a single task can have before it's considered failed and not attempted to run anymore.
max_sync_errors = 10
```
## safekeeper
TODO

View File

@@ -28,7 +28,7 @@ The pageserver has a few different duties:
- Receive WAL from the WAL service and decode it.
- Replay WAL that's applicable to the chunks that the Page Server maintains
For more detailed info, see `/pageserver/README`
For more detailed info, see [/pageserver/README](/pageserver/README.md)
`/proxy`:
@@ -57,7 +57,7 @@ PostgreSQL extension that contains functions needed for testing and debugging.
The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
It acts as a holding area and redistribution center for recently generated WAL.
For more detailed info, see `/safekeeper/README`
For more detailed info, see [/safekeeper/README](/safekeeper/README.md)
`/workspace_hack`:
The workspace_hack crate exists only to pin down some dependencies.
@@ -91,18 +91,22 @@ so manual installation of dependencies is not recommended.
A single virtual environment with all dependencies is described in the single `Pipfile`.
### Prerequisites
- Install Python 3.7 (the minimal supported version) or greater.
- Install Python 3.9 (the minimal supported version) or greater.
- Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected.
- If you have some trouble with other version you can resolve it by installing Python 3.7 separately, via pyenv or via system package manager e.g.:
- If you have some trouble with other version you can resolve it by installing Python 3.9 separately, via [pyenv](https://github.com/pyenv/pyenv) or via system package manager e.g.:
```bash
# In Ubuntu
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
sudo apt install python3.9
```
- Install `poetry`
- Exact version of `poetry` is not important, see installation instructions available at poetry's [website](https://python-poetry.org/docs/#installation)`.
- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.7 so if you have different version some linting tools can yield different result locally vs in the CI.
- Install dependencies via `./scripts/pysync`.
- Note that CI uses specific Python version (look for `PYTHON_VERSION` [here](https://github.com/neondatabase/docker-images/blob/main/rust/Dockerfile))
so if you have different version some linting tools can yield different result locally vs in the CI.
- You can explicitly specify which Python to use by running `poetry env use /path/to/python`, e.g. `poetry env use python3.9`.
This may also disable the `The currently activated Python version X.Y.Z is not supported by the project` warning.
Run `poetry shell` to activate the virtual environment.
Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`.

View File

@@ -0,0 +1,17 @@
[package]
name = "etcd_broker"
version = "0.1.0"
edition = "2021"
[dependencies]
etcd-client = "0.9.0"
regex = "1.4.5"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "1.12.0"
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
tokio = "1"
tracing = "0.1"
thiserror = "1"

348
libs/etcd_broker/src/lib.rs Normal file
View File

@@ -0,0 +1,348 @@
//! A set of primitives to access a shared data/updates, propagated via etcd broker (not persistent).
//! Intended to connect services to each other, not to store their data.
use std::{
collections::{hash_map, HashMap},
fmt::Display,
str::FromStr,
};
use regex::{Captures, Regex};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
pub use etcd_client::*;
use tokio::{sync::mpsc, task::JoinHandle};
use tracing::*;
use utils::{
lsn::Lsn,
zid::{ZNodeId, ZTenantId, ZTenantTimelineId},
};
/// Default value to use for prefixing to all etcd keys with.
/// This way allows isolating safekeeper/pageserver groups in the same etcd cluster.
pub const DEFAULT_NEON_BROKER_ETCD_PREFIX: &str = "neon";
#[derive(Debug, Deserialize, Serialize)]
struct SafekeeperTimeline {
safekeeper_id: ZNodeId,
info: SkTimelineInfo,
}
/// Published data about safekeeper's timeline. Fields made optional for easy migrations.
#[serde_as]
#[derive(Debug, Deserialize, Serialize)]
pub struct SkTimelineInfo {
/// Term of the last entry.
pub last_log_term: Option<u64>,
/// LSN of the last record.
#[serde_as(as = "Option<DisplayFromStr>")]
#[serde(default)]
pub flush_lsn: Option<Lsn>,
/// Up to which LSN safekeeper regards its WAL as committed.
#[serde_as(as = "Option<DisplayFromStr>")]
#[serde(default)]
pub commit_lsn: Option<Lsn>,
/// LSN up to which safekeeper offloaded WAL to s3.
#[serde_as(as = "Option<DisplayFromStr>")]
#[serde(default)]
pub s3_wal_lsn: Option<Lsn>,
/// LSN of last checkpoint uploaded by pageserver.
#[serde_as(as = "Option<DisplayFromStr>")]
#[serde(default)]
pub remote_consistent_lsn: Option<Lsn>,
#[serde_as(as = "Option<DisplayFromStr>")]
#[serde(default)]
pub peer_horizon_lsn: Option<Lsn>,
#[serde(default)]
pub safekeeper_connection_string: Option<String>,
}
#[derive(Debug, thiserror::Error)]
pub enum BrokerError {
#[error("Etcd client error: {0}. Context: {1}")]
EtcdClient(etcd_client::Error, String),
#[error("Error during parsing etcd data: {0}")]
ParsingError(String),
#[error("Internal error: {0}")]
InternalError(String),
}
/// A way to control the data retrieval from a certain subscription.
pub struct SkTimelineSubscription {
safekeeper_timeline_updates:
mpsc::UnboundedReceiver<HashMap<ZTenantTimelineId, HashMap<ZNodeId, SkTimelineInfo>>>,
kind: SkTimelineSubscriptionKind,
watcher_handle: JoinHandle<Result<(), BrokerError>>,
watcher: Watcher,
}
impl SkTimelineSubscription {
/// Asynchronously polls for more data from the subscription, suspending the current future if there's no data sent yet.
pub async fn fetch_data(
&mut self,
) -> Option<HashMap<ZTenantTimelineId, HashMap<ZNodeId, SkTimelineInfo>>> {
self.safekeeper_timeline_updates.recv().await
}
/// Cancels the subscription, stopping the data poller and waiting for it to shut down.
pub async fn cancel(mut self) -> Result<(), BrokerError> {
self.watcher.cancel().await.map_err(|e| {
BrokerError::EtcdClient(
e,
format!(
"Failed to cancel timeline subscription, kind: {:?}",
self.kind
),
)
})?;
self.watcher_handle.await.map_err(|e| {
BrokerError::InternalError(format!(
"Failed to join the timeline updates task, kind: {:?}, error: {e}",
self.kind
))
})?
}
}
/// The subscription kind to the timeline updates from safekeeper.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct SkTimelineSubscriptionKind {
broker_etcd_prefix: String,
kind: SubscriptionKind,
}
impl SkTimelineSubscriptionKind {
pub fn all(broker_etcd_prefix: String) -> Self {
Self {
broker_etcd_prefix,
kind: SubscriptionKind::All,
}
}
pub fn tenant(broker_etcd_prefix: String, tenant: ZTenantId) -> Self {
Self {
broker_etcd_prefix,
kind: SubscriptionKind::Tenant(tenant),
}
}
pub fn timeline(broker_etcd_prefix: String, timeline: ZTenantTimelineId) -> Self {
Self {
broker_etcd_prefix,
kind: SubscriptionKind::Timeline(timeline),
}
}
fn watch_regex(&self) -> Regex {
match self.kind {
SubscriptionKind::All => Regex::new(&format!(
r"^{}/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]])$",
self.broker_etcd_prefix
))
.expect("wrong regex for 'everything' subscription"),
SubscriptionKind::Tenant(tenant_id) => Regex::new(&format!(
r"^{}/{tenant_id}/([[:xdigit:]]+)/safekeeper/([[:digit:]])$",
self.broker_etcd_prefix
))
.expect("wrong regex for 'tenant' subscription"),
SubscriptionKind::Timeline(ZTenantTimelineId {
tenant_id,
timeline_id,
}) => Regex::new(&format!(
r"^{}/{tenant_id}/{timeline_id}/safekeeper/([[:digit:]])$",
self.broker_etcd_prefix
))
.expect("wrong regex for 'timeline' subscription"),
}
}
/// Etcd key to use for watching a certain timeline updates from safekeepers.
pub fn watch_key(&self) -> String {
match self.kind {
SubscriptionKind::All => self.broker_etcd_prefix.to_string(),
SubscriptionKind::Tenant(tenant_id) => {
format!("{}/{tenant_id}/safekeeper", self.broker_etcd_prefix)
}
SubscriptionKind::Timeline(ZTenantTimelineId {
tenant_id,
timeline_id,
}) => format!(
"{}/{tenant_id}/{timeline_id}/safekeeper",
self.broker_etcd_prefix
),
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum SubscriptionKind {
/// Get every timeline update.
All,
/// Get certain tenant timelines' updates.
Tenant(ZTenantId),
/// Get certain timeline updates.
Timeline(ZTenantTimelineId),
}
/// Creates a background task to poll etcd for timeline updates from safekeepers.
/// Stops and returns `Err` on any error during etcd communication.
/// Watches the key changes until either the watcher is cancelled via etcd or the subscription cancellation handle,
/// exiting normally in such cases.
pub async fn subscribe_to_safekeeper_timeline_updates(
client: &mut Client,
subscription: SkTimelineSubscriptionKind,
) -> Result<SkTimelineSubscription, BrokerError> {
info!("Subscribing to timeline updates, subscription kind: {subscription:?}");
let (watcher, mut stream) = client
.watch(
subscription.watch_key(),
Some(WatchOptions::new().with_prefix()),
)
.await
.map_err(|e| {
BrokerError::EtcdClient(
e,
format!("Failed to init the watch for subscription {subscription:?}"),
)
})?;
let (timeline_updates_sender, safekeeper_timeline_updates) = mpsc::unbounded_channel();
let subscription_kind = subscription.kind;
let regex = subscription.watch_regex();
let watcher_handle = tokio::spawn(async move {
while let Some(resp) = stream.message().await.map_err(|e| BrokerError::InternalError(format!(
"Failed to get messages from the subscription stream, kind: {subscription_kind:?}, error: {e}"
)))? {
if resp.canceled() {
info!("Watch for timeline updates subscription was canceled, exiting");
break;
}
let mut timeline_updates: HashMap<ZTenantTimelineId, HashMap<ZNodeId, SkTimelineInfo>> = HashMap::new();
// Keep track that the timeline data updates from etcd arrive in the right order.
// https://etcd.io/docs/v3.5/learning/api_guarantees/#isolation-level-and-consistency-of-replicas
// > etcd does not ensure linearizability for watch operations. Users are expected to verify the revision of watch responses to ensure correct ordering.
let mut timeline_etcd_versions: HashMap<ZTenantTimelineId, i64> = HashMap::new();
let events = resp.events();
debug!("Processing {} events", events.len());
for event in events {
if EventType::Put == event.event_type() {
if let Some(new_etcd_kv) = event.kv() {
let new_kv_version = new_etcd_kv.version();
match parse_etcd_key_value(subscription_kind, &regex, new_etcd_kv) {
Ok(Some((zttid, timeline))) => {
match timeline_updates
.entry(zttid)
.or_default()
.entry(timeline.safekeeper_id)
{
hash_map::Entry::Occupied(mut o) => {
let old_etcd_kv_version = timeline_etcd_versions.get(&zttid).copied().unwrap_or(i64::MIN);
if old_etcd_kv_version < new_kv_version {
o.insert(timeline.info);
timeline_etcd_versions.insert(zttid,new_kv_version);
}
}
hash_map::Entry::Vacant(v) => {
v.insert(timeline.info);
timeline_etcd_versions.insert(zttid,new_kv_version);
}
}
}
Ok(None) => {}
Err(e) => error!("Failed to parse timeline update: {e}"),
};
}
}
}
if let Err(e) = timeline_updates_sender.send(timeline_updates) {
info!("Timeline updates sender got dropped, exiting: {e}");
break;
}
}
Ok(())
});
Ok(SkTimelineSubscription {
kind: subscription,
safekeeper_timeline_updates,
watcher_handle,
watcher,
})
}
fn parse_etcd_key_value(
subscription_kind: SubscriptionKind,
regex: &Regex,
kv: &KeyValue,
) -> Result<Option<(ZTenantTimelineId, SafekeeperTimeline)>, BrokerError> {
let caps = if let Some(caps) = regex.captures(kv.key_str().map_err(|e| {
BrokerError::EtcdClient(e, format!("Failed to represent kv {kv:?} as key str"))
})?) {
caps
} else {
return Ok(None);
};
let (zttid, safekeeper_id) = match subscription_kind {
SubscriptionKind::All => (
ZTenantTimelineId::new(
parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?,
parse_capture(&caps, 2).map_err(BrokerError::ParsingError)?,
),
ZNodeId(parse_capture(&caps, 3).map_err(BrokerError::ParsingError)?),
),
SubscriptionKind::Tenant(tenant_id) => (
ZTenantTimelineId::new(
tenant_id,
parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?,
),
ZNodeId(parse_capture(&caps, 2).map_err(BrokerError::ParsingError)?),
),
SubscriptionKind::Timeline(zttid) => (
zttid,
ZNodeId(parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?),
),
};
let info_str = kv.value_str().map_err(|e| {
BrokerError::EtcdClient(e, format!("Failed to represent kv {kv:?} as value str"))
})?;
Ok(Some((
zttid,
SafekeeperTimeline {
safekeeper_id,
info: serde_json::from_str(info_str).map_err(|e| {
BrokerError::ParsingError(format!(
"Failed to parse '{info_str}' as safekeeper timeline info: {e}"
))
})?,
},
)))
}
fn parse_capture<T>(caps: &Captures, index: usize) -> Result<T, String>
where
T: FromStr,
<T as FromStr>::Err: Display,
{
let capture_match = caps
.get(index)
.ok_or_else(|| format!("Failed to get capture match at index {index}"))?
.as_str();
capture_match.parse().map_err(|e| {
format!(
"Failed to parse {} from {capture_match}: {e}",
std::any::type_name::<T>()
)
})
}

View File

@@ -4,7 +4,7 @@ version = "0.1.0"
edition = "2021"
[dependencies]
prometheus = {version = "0.13", default_features=false} # removes protobuf dependency
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
libc = "0.2"
lazy_static = "1.4"
once_cell = "1.8.0"

View File

@@ -3,7 +3,6 @@
//! Otherwise, we might not see all metrics registered via
//! a default registry.
use lazy_static::lazy_static;
use once_cell::race::OnceBox;
pub use prometheus::{exponential_buckets, linear_buckets};
pub use prometheus::{register_gauge, Gauge};
pub use prometheus::{register_gauge_vec, GaugeVec};
@@ -27,48 +26,15 @@ pub fn gather() -> Vec<prometheus::proto::MetricFamily> {
prometheus::gather()
}
static COMMON_METRICS_PREFIX: OnceBox<&str> = OnceBox::new();
/// Sets a prefix which will be used for all common metrics, typically a service
/// name like 'pageserver'. Should be executed exactly once in the beginning of
/// any executable which uses common metrics.
pub fn set_common_metrics_prefix(prefix: &'static str) {
// Not unwrap() because metrics may be initialized after multiple threads have been started.
COMMON_METRICS_PREFIX
.set(prefix.into())
.unwrap_or_else(|_| {
eprintln!(
"set_common_metrics_prefix() was called second time with '{}', exiting",
prefix
);
std::process::exit(1);
});
}
/// Prepends a prefix to a common metric name so they are distinguished between
/// different services, see <https://github.com/zenithdb/zenith/pull/681>
/// A call to set_common_metrics_prefix() is necessary prior to calling this.
pub fn new_common_metric_name(unprefixed_metric_name: &str) -> String {
// Not unwrap() because metrics may be initialized after multiple threads have been started.
format!(
"{}_{}",
COMMON_METRICS_PREFIX.get().unwrap_or_else(|| {
eprintln!("set_common_metrics_prefix() was not called, but metrics are used, exiting");
std::process::exit(1);
}),
unprefixed_metric_name
)
}
lazy_static! {
static ref DISK_IO_BYTES: IntGaugeVec = register_int_gauge_vec!(
new_common_metric_name("disk_io_bytes"),
"libmetrics_disk_io_bytes_total",
"Bytes written and read from disk, grouped by the operation (read|write)",
&["io_operation"]
)
.expect("Failed to register disk i/o bytes int gauge vec");
static ref MAXRSS_KB: IntGauge = register_int_gauge!(
new_common_metric_name("maxrss_kb"),
"libmetrics_maxrss_kb",
"Memory usage (Maximum Resident Set Size)"
)
.expect("Failed to register maxrss_kb int gauge");

View File

@@ -20,5 +20,10 @@ serde = { version = "1.0", features = ["derive"] }
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[dev-dependencies]
env_logger = "0.9"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
wal_generate = { path = "wal_generate" }
[build-dependencies]
bindgen = "0.59.1"

View File

@@ -8,6 +8,7 @@
#![allow(deref_nullptr)]
use serde::{Deserialize, Serialize};
use utils::lsn::Lsn;
include!(concat!(env!("OUT_DIR"), "/bindings.rs"));
@@ -37,3 +38,21 @@ pub const fn transaction_id_precedes(id1: TransactionId, id2: TransactionId) ->
let diff = id1.wrapping_sub(id2) as i32;
diff < 0
}
// Check if page is not yet initialized (port of Postgres PageIsInit() macro)
pub fn page_is_new(pg: &[u8]) -> bool {
pg[14] == 0 && pg[15] == 0 // pg_upper == 0
}
// ExtractLSN from page header
pub fn page_get_lsn(pg: &[u8]) -> Lsn {
Lsn(
((u32::from_le_bytes(pg[0..4].try_into().unwrap()) as u64) << 32)
| u32::from_le_bytes(pg[4..8].try_into().unwrap()) as u64,
)
}
pub fn page_set_lsn(pg: &mut [u8], lsn: Lsn) {
pg[0..4].copy_from_slice(&((lsn.0 >> 32) as u32).to_le_bytes());
pg[4..8].copy_from_slice(&(lsn.0 as u32).to_le_bytes());
}

View File

@@ -89,7 +89,12 @@ impl WalStreamDecoder {
return Ok(None);
}
let hdr = XLogLongPageHeaderData::from_bytes(&mut self.inputbuf);
let hdr = XLogLongPageHeaderData::from_bytes(&mut self.inputbuf).map_err(|e| {
WalDecodeError {
msg: format!("long header deserialization failed {}", e),
lsn: self.lsn,
}
})?;
if hdr.std.xlp_pageaddr != self.lsn.0 {
return Err(WalDecodeError {
@@ -106,7 +111,12 @@ impl WalStreamDecoder {
return Ok(None);
}
let hdr = XLogPageHeaderData::from_bytes(&mut self.inputbuf);
let hdr = XLogPageHeaderData::from_bytes(&mut self.inputbuf).map_err(|e| {
WalDecodeError {
msg: format!("header deserialization failed {}", e),
lsn: self.lsn,
}
})?;
if hdr.xlp_pageaddr != self.lsn.0 {
return Err(WalDecodeError {
@@ -188,7 +198,13 @@ impl WalStreamDecoder {
}
// We now have a record in the 'recordbuf' local variable.
let xlogrec = XLogRecord::from_slice(&recordbuf[0..XLOG_SIZE_OF_XLOG_RECORD]);
let xlogrec =
XLogRecord::from_slice(&recordbuf[0..XLOG_SIZE_OF_XLOG_RECORD]).map_err(|e| {
WalDecodeError {
msg: format!("xlog record deserialization failed {}", e),
lsn: self.lsn,
}
})?;
let mut crc = 0;
crc = crc32c_append(crc, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]);

View File

@@ -15,7 +15,7 @@ use crate::XLogPageHeaderData;
use crate::XLogRecord;
use crate::XLOG_PAGE_MAGIC;
use anyhow::{bail, Result};
use anyhow::{bail, ensure};
use byteorder::{ByteOrder, LittleEndian};
use bytes::BytesMut;
use bytes::{Buf, Bytes};
@@ -28,6 +28,9 @@ use std::io::prelude::*;
use std::io::SeekFrom;
use std::path::{Path, PathBuf};
use std::time::SystemTime;
use utils::bin_ser::DeserializeError;
use utils::bin_ser::SerializeError;
use utils::const_assert;
use utils::lsn::Lsn;
pub const XLOG_FNAME_LEN: usize = 24;
@@ -118,11 +121,15 @@ pub fn normalize_lsn(lsn: Lsn, seg_sz: usize) -> Lsn {
}
pub fn get_current_timestamp() -> TimestampTz {
to_pg_timestamp(SystemTime::now())
}
pub fn to_pg_timestamp(time: SystemTime) -> TimestampTz {
const UNIX_EPOCH_JDATE: u64 = 2440588; /* == date2j(1970, 1, 1) */
const POSTGRES_EPOCH_JDATE: u64 = 2451545; /* == date2j(2000, 1, 1) */
const SECS_PER_DAY: u64 = 86400;
const USECS_PER_SEC: u64 = 1000000;
match SystemTime::now().duration_since(SystemTime::UNIX_EPOCH) {
match time.duration_since(SystemTime::UNIX_EPOCH) {
Ok(n) => {
((n.as_secs() - ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY))
* USECS_PER_SEC
@@ -140,11 +147,12 @@ fn find_end_of_wal_segment(
tli: TimeLineID,
wal_seg_size: usize,
start_offset: usize, // start reading at this point
) -> Result<u32> {
) -> anyhow::Result<u32> {
// step back to the beginning of the page to read it in...
let mut offs: usize = start_offset - start_offset % XLOG_BLCKSZ;
let mut skipping_first_contrecord: bool = false;
let mut contlen: usize = 0;
let mut wal_crc: u32 = 0;
let mut xl_crc: u32 = 0;
let mut crc: u32 = 0;
let mut rec_offs: usize = 0;
let mut buf = [0u8; XLOG_BLCKSZ];
@@ -152,11 +160,15 @@ fn find_end_of_wal_segment(
let mut last_valid_rec_pos: usize = start_offset; // assume at given start_offset begins new record
let mut file = File::open(data_dir.join(file_name.clone() + ".partial")).unwrap();
file.seek(SeekFrom::Start(offs as u64))?;
// xl_crc is the last field in XLogRecord, will not be read into rec_hdr
const_assert!(XLOG_RECORD_CRC_OFFS + 4 == XLOG_SIZE_OF_XLOG_RECORD);
let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS];
trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset);
while offs < wal_seg_size {
// we are at the beginning of the page; read it in
if offs % XLOG_BLCKSZ == 0 {
trace!("offs=0x{:x}: new page", offs);
let bytes_read = file.read(&mut buf)?;
if bytes_read != buf.len() {
bail!(
@@ -170,30 +182,49 @@ fn find_end_of_wal_segment(
let xlp_magic = LittleEndian::read_u16(&buf[0..2]);
let xlp_info = LittleEndian::read_u16(&buf[2..4]);
let xlp_rem_len = LittleEndian::read_u32(&buf[XLP_REM_LEN_OFFS..XLP_REM_LEN_OFFS + 4]);
trace!(
" xlp_magic=0x{:x}, xlp_info=0x{:x}, xlp_rem_len={}",
xlp_magic,
xlp_info,
xlp_rem_len
);
// this is expected in current usage when valid WAL starts after page header
if xlp_magic != XLOG_PAGE_MAGIC as u16 {
trace!(
"invalid WAL file {}.partial magic {} at {:?}",
" invalid WAL file {}.partial magic {} at {:?}",
file_name,
xlp_magic,
Lsn(XLogSegNoOffsetToRecPtr(segno, offs as u32, wal_seg_size)),
);
}
if offs == 0 {
offs = XLOG_SIZE_OF_XLOG_LONG_PHD;
offs += XLOG_SIZE_OF_XLOG_LONG_PHD;
if (xlp_info & XLP_FIRST_IS_CONTRECORD) != 0 {
offs += ((xlp_rem_len + 7) & !7) as usize;
trace!(" first record is contrecord");
skipping_first_contrecord = true;
contlen = xlp_rem_len as usize;
if offs < start_offset {
// Pre-condition failed: the beginning of the segment is unexpectedly corrupted.
ensure!(start_offset - offs >= contlen,
"start_offset is in the middle of the first record (which happens to be a contrecord), \
expected to be on a record boundary. Is beginning of the segment corrupted?");
contlen = 0;
// keep skipping_first_contrecord to avoid counting the contrecord as valid, we did not check it.
}
} else {
trace!(" first record is not contrecord");
}
} else {
offs += XLOG_SIZE_OF_XLOG_SHORT_PHD;
}
// ... and step forward again if asked
trace!(" skipped header to 0x{:x}", offs);
offs = max(offs, start_offset);
// beginning of the next record
} else if contlen == 0 {
let page_offs = offs % XLOG_BLCKSZ;
let xl_tot_len = LittleEndian::read_u32(&buf[page_offs..page_offs + 4]) as usize;
trace!("offs=0x{:x}: new record, xl_tot_len={}", offs, xl_tot_len);
if xl_tot_len == 0 {
info!(
"find_end_of_wal_segment reached zeros at {:?}, last records ends at {:?}",
@@ -206,10 +237,25 @@ fn find_end_of_wal_segment(
);
break; // zeros, reached the end
}
last_valid_rec_pos = offs;
if skipping_first_contrecord {
skipping_first_contrecord = false;
trace!(" first contrecord has been just completed");
} else {
trace!(
" updating last_valid_rec_pos: 0x{:x} --> 0x{:x}",
last_valid_rec_pos,
offs
);
last_valid_rec_pos = offs;
}
offs += 4;
rec_offs = 4;
contlen = xl_tot_len - 4;
trace!(
" reading rec_hdr[0..4] <-- [0x{:x}; 0x{:x})",
page_offs,
page_offs + 4
);
rec_hdr[0..4].copy_from_slice(&buf[page_offs..page_offs + 4]);
} else {
// we're continuing a record, possibly from previous page.
@@ -218,42 +264,118 @@ fn find_end_of_wal_segment(
// read the rest of the record, or as much as fits on this page.
let n = min(contlen, pageleft);
// fill rec_hdr (header up to (but not including) xl_crc field)
trace!(
"offs=0x{:x}, record continuation, pageleft={}, contlen={}",
offs,
pageleft,
contlen
);
// fill rec_hdr header up to (but not including) xl_crc field
trace!(
" rec_offs={}, XLOG_RECORD_CRC_OFFS={}, XLOG_SIZE_OF_XLOG_RECORD={}",
rec_offs,
XLOG_RECORD_CRC_OFFS,
XLOG_SIZE_OF_XLOG_RECORD
);
if rec_offs < XLOG_RECORD_CRC_OFFS {
let len = min(XLOG_RECORD_CRC_OFFS - rec_offs, n);
trace!(
" reading rec_hdr[{}..{}] <-- [0x{:x}; 0x{:x})",
rec_offs,
rec_offs + len,
page_offs,
page_offs + len
);
rec_hdr[rec_offs..rec_offs + len].copy_from_slice(&buf[page_offs..page_offs + len]);
}
if rec_offs <= XLOG_RECORD_CRC_OFFS && rec_offs + n >= XLOG_SIZE_OF_XLOG_RECORD {
let crc_offs = page_offs - rec_offs + XLOG_RECORD_CRC_OFFS;
wal_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]);
// All records are aligned on 8-byte boundary, so their 8-byte frames
// cannot be split between pages. As xl_crc is the last field,
// its content is always on the same page.
const_assert!(XLOG_RECORD_CRC_OFFS % 8 == 4);
// We should always start reading aligned records even in incorrect WALs so if
// the condition is false it is likely a bug. However, it is localized somewhere
// in this function, hence we do not crash and just report failure instead.
ensure!(crc_offs % 8 == 4, "Record is not aligned properly (bug?)");
xl_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]);
trace!(
" reading xl_crc: [0x{:x}; 0x{:x}) = 0x{:x}",
crc_offs,
crc_offs + 4,
xl_crc
);
crc = crc32c_append(0, &buf[crc_offs + 4..page_offs + n]);
} else {
crc ^= 0xFFFFFFFFu32;
trace!(
" initializing crc: [0x{:x}; 0x{:x}); crc = 0x{:x}",
crc_offs + 4,
page_offs + n,
crc
);
} else if rec_offs > XLOG_RECORD_CRC_OFFS {
// As all records are 8-byte aligned, the header is already fully read and `crc` is initialized in the branch above.
ensure!(rec_offs >= XLOG_SIZE_OF_XLOG_RECORD);
let old_crc = crc;
crc = crc32c_append(crc, &buf[page_offs..page_offs + n]);
trace!(
" appending to crc: [0x{:x}; 0x{:x}); 0x{:x} --> 0x{:x}",
page_offs,
page_offs + n,
old_crc,
crc
);
} else {
// Correct because of the way conditions are written above.
assert!(rec_offs + n < XLOG_SIZE_OF_XLOG_RECORD);
// If `skipping_first_contrecord == true`, we may be reading from a middle of a record
// which started in the previous segment. Hence there is no point in validating the header.
if !skipping_first_contrecord && rec_offs + n > XLOG_RECORD_CRC_OFFS {
info!(
"Curiously corrupted WAL: a record stops inside the header; \
offs=0x{:x}, record continuation, pageleft={}, contlen={}",
offs, pageleft, contlen
);
break;
}
// Do nothing: we are still reading the header. It's accounted in CRC in the end of the record.
}
crc = !crc;
rec_offs += n;
offs += n;
contlen -= n;
if contlen == 0 {
crc = !crc;
trace!(" record completed at 0x{:x}", offs);
crc = crc32c_append(crc, &rec_hdr);
offs = (offs + 7) & !7; // pad on 8 bytes boundary */
if crc == wal_crc {
trace!(
" padded offs to 0x{:x}, crc is {:x}, expected crc is {:x}",
offs,
crc,
xl_crc
);
if skipping_first_contrecord {
// do nothing, the flag will go down on next iteration when we're reading new record
trace!(" first conrecord has been just completed");
} else if crc == xl_crc {
// record is valid, advance the result to its end (with
// alignment to the next record taken into account)
trace!(
" updating last_valid_rec_pos: 0x{:x} --> 0x{:x}",
last_valid_rec_pos,
offs
);
last_valid_rec_pos = offs;
} else {
info!(
"CRC mismatch {} vs {} at {}",
crc, wal_crc, last_valid_rec_pos
crc, xl_crc, last_valid_rec_pos
);
break;
}
}
}
}
trace!("last_valid_rec_pos=0x{:x}", last_valid_rec_pos);
Ok(last_valid_rec_pos as u32)
}
@@ -268,7 +390,7 @@ pub fn find_end_of_wal(
wal_seg_size: usize,
precise: bool,
start_lsn: Lsn, // start reading WAL at this point or later
) -> Result<(XLogRecPtr, TimeLineID)> {
) -> anyhow::Result<(XLogRecPtr, TimeLineID)> {
let mut high_segno: XLogSegNo = 0;
let mut high_tli: TimeLineID = 0;
let mut high_ispartial = false;
@@ -350,19 +472,19 @@ pub fn main() {
}
impl XLogRecord {
pub fn from_slice(buf: &[u8]) -> XLogRecord {
pub fn from_slice(buf: &[u8]) -> Result<XLogRecord, DeserializeError> {
use utils::bin_ser::LeSer;
XLogRecord::des(buf).unwrap()
XLogRecord::des(buf)
}
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogRecord {
pub fn from_bytes<B: Buf>(buf: &mut B) -> Result<XLogRecord, DeserializeError> {
use utils::bin_ser::LeSer;
XLogRecord::des_from(&mut buf.reader()).unwrap()
XLogRecord::des_from(&mut buf.reader())
}
pub fn encode(&self) -> Bytes {
pub fn encode(&self) -> Result<Bytes, SerializeError> {
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
Ok(self.ser()?.into())
}
// Is this record an XLOG_SWITCH record? They need some special processing,
@@ -372,35 +494,35 @@ impl XLogRecord {
}
impl XLogPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> Result<XLogPageHeaderData, DeserializeError> {
use utils::bin_ser::LeSer;
XLogPageHeaderData::des_from(&mut buf.reader()).unwrap()
XLogPageHeaderData::des_from(&mut buf.reader())
}
}
impl XLogLongPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogLongPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> Result<XLogLongPageHeaderData, DeserializeError> {
use utils::bin_ser::LeSer;
XLogLongPageHeaderData::des_from(&mut buf.reader()).unwrap()
XLogLongPageHeaderData::des_from(&mut buf.reader())
}
pub fn encode(&self) -> Bytes {
pub fn encode(&self) -> Result<Bytes, SerializeError> {
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
self.ser().map(|b| b.into())
}
}
pub const SIZEOF_CHECKPOINT: usize = std::mem::size_of::<CheckPoint>();
impl CheckPoint {
pub fn encode(&self) -> Bytes {
pub fn encode(&self) -> Result<Bytes, SerializeError> {
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
Ok(self.ser()?.into())
}
pub fn decode(buf: &[u8]) -> Result<CheckPoint, anyhow::Error> {
pub fn decode(buf: &[u8]) -> Result<CheckPoint, DeserializeError> {
use utils::bin_ser::LeSer;
Ok(CheckPoint::des(buf)?)
CheckPoint::des(buf)
}
/// Update next XID based on provided new_xid and stored epoch.
@@ -438,7 +560,7 @@ impl CheckPoint {
// Generate new, empty WAL segment.
// We need this segment to start compute node.
//
pub fn generate_wal_segment(segno: u64, system_id: u64) -> Bytes {
pub fn generate_wal_segment(segno: u64, system_id: u64) -> Result<Bytes, SerializeError> {
let mut seg_buf = BytesMut::with_capacity(pg_constants::WAL_SEGMENT_SIZE as usize);
let pageaddr = XLogSegNoOffsetToRecPtr(segno, 0, pg_constants::WAL_SEGMENT_SIZE);
@@ -458,90 +580,138 @@ pub fn generate_wal_segment(segno: u64, system_id: u64) -> Bytes {
xlp_xlog_blcksz: XLOG_BLCKSZ as u32,
};
let hdr_bytes = hdr.encode();
let hdr_bytes = hdr.encode()?;
seg_buf.extend_from_slice(&hdr_bytes);
//zero out the rest of the file
seg_buf.resize(pg_constants::WAL_SEGMENT_SIZE, 0);
seg_buf.freeze()
Ok(seg_buf.freeze())
}
#[cfg(test)]
mod tests {
use super::*;
use regex::Regex;
use std::{env, process::Command, str::FromStr};
use std::{env, str::FromStr};
// Run find_end_of_wal against file in test_wal dir
// Ensure that it finds last record correctly
#[test]
pub fn test_find_end_of_wal() {
// 1. Run initdb to generate some WAL
fn init_logging() {
let _ = env_logger::Builder::from_env(
env_logger::Env::default()
.default_filter_or("wal_generate=info,postgres_ffi::xlog_utils=trace"),
)
.is_test(true)
.try_init();
}
fn test_end_of_wal(
test_name: &str,
generate_wal: impl Fn(&mut postgres::Client) -> anyhow::Result<postgres::types::PgLsn>,
expected_end_of_wal_non_partial: Lsn,
last_segment: &str,
) {
use wal_generate::*;
// 1. Generate some WAL
let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("..")
.join("..");
let data_dir = top_path.join("test_output/test_find_end_of_wal");
let initdb_path = top_path.join("tmp_install/bin/initdb");
let lib_path = top_path.join("tmp_install/lib");
if data_dir.exists() {
fs::remove_dir_all(&data_dir).unwrap();
let cfg = Conf {
pg_distrib_dir: top_path.join("tmp_install"),
datadir: top_path.join(format!("test_output/{}", test_name)),
};
if cfg.datadir.exists() {
fs::remove_dir_all(&cfg.datadir).unwrap();
}
println!("Using initdb from '{}'", initdb_path.display());
println!("Data directory '{}'", data_dir.display());
let initdb_output = Command::new(initdb_path)
.args(&["-D", data_dir.to_str().unwrap()])
.arg("--no-instructions")
.arg("--no-sync")
.env_clear()
.env("LD_LIBRARY_PATH", &lib_path)
.env("DYLD_LIBRARY_PATH", &lib_path)
.output()
.unwrap();
assert!(
initdb_output.status.success(),
"initdb failed. Status: '{}', stdout: '{}', stderr: '{}'",
initdb_output.status,
String::from_utf8_lossy(&initdb_output.stdout),
String::from_utf8_lossy(&initdb_output.stderr),
);
cfg.initdb().unwrap();
let mut srv = cfg.start_server().unwrap();
let expected_wal_end: Lsn =
u64::from(generate_wal(&mut srv.connect_with_timeout().unwrap()).unwrap()).into();
srv.kill();
// 2. Pick WAL generated by initdb
let wal_dir = data_dir.join("pg_wal");
let wal_dir = cfg.datadir.join("pg_wal");
let wal_seg_size = 16 * 1024 * 1024;
// 3. Check end_of_wal on non-partial WAL segment (we treat it as fully populated)
let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true, Lsn(0)).unwrap();
let wal_end = Lsn(wal_end);
println!("wal_end={}, tli={}", wal_end, tli);
assert_eq!(wal_end, "0/2000000".parse::<Lsn>().unwrap());
info!(
"find_end_of_wal returned (wal_end={}, tli={})",
wal_end, tli
);
assert_eq!(wal_end, expected_end_of_wal_non_partial);
// 4. Get the actual end of WAL by pg_waldump
let waldump_path = top_path.join("tmp_install/bin/pg_waldump");
let waldump_output = Command::new(waldump_path)
.arg(wal_dir.join("000000010000000000000001"))
.env_clear()
.env("LD_LIBRARY_PATH", &lib_path)
.env("DYLD_LIBRARY_PATH", &lib_path)
.output()
.unwrap();
let waldump_output = std::str::from_utf8(&waldump_output.stderr).unwrap();
println!("waldump_output = '{}'", &waldump_output);
let re = Regex::new(r"invalid record length at (.+):").unwrap();
let caps = re.captures(waldump_output).unwrap();
let waldump_output = cfg
.pg_waldump("000000010000000000000001", last_segment)
.unwrap()
.stderr;
let waldump_output = std::str::from_utf8(&waldump_output).unwrap();
let caps = match Regex::new(r"invalid record length at (.+):")
.unwrap()
.captures(waldump_output)
{
Some(caps) => caps,
None => {
error!("Unable to parse pg_waldump's stderr:\n{}", waldump_output);
panic!();
}
};
let waldump_wal_end = Lsn::from_str(caps.get(1).unwrap().as_str()).unwrap();
info!(
"waldump erred on {}, expected wal end at {}",
waldump_wal_end, expected_wal_end
);
assert_eq!(waldump_wal_end, expected_wal_end);
// 5. Rename file to partial to actually find last valid lsn
fs::rename(
wal_dir.join("000000010000000000000001"),
wal_dir.join("000000010000000000000001.partial"),
wal_dir.join(last_segment),
wal_dir.join(format!("{}.partial", last_segment)),
)
.unwrap();
let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true, Lsn(0)).unwrap();
let wal_end = Lsn(wal_end);
println!("wal_end={}, tli={}", wal_end, tli);
info!(
"find_end_of_wal returned (wal_end={}, tli={})",
wal_end, tli
);
assert_eq!(wal_end, waldump_wal_end);
}
#[test]
pub fn test_find_end_of_wal_simple() {
init_logging();
test_end_of_wal(
"test_find_end_of_wal_simple",
wal_generate::generate_simple,
"0/2000000".parse::<Lsn>().unwrap(),
"000000010000000000000001",
);
}
#[test]
pub fn test_find_end_of_wal_crossing_segment_followed_by_small_one() {
init_logging();
test_end_of_wal(
"test_find_end_of_wal_crossing_segment_followed_by_small_one",
wal_generate::generate_wal_record_crossing_segment_followed_by_small_one,
"0/3000000".parse::<Lsn>().unwrap(),
"000000010000000000000002",
);
}
#[test]
#[ignore = "not yet fixed, needs correct parsing of pre-last segments"] // TODO
pub fn test_find_end_of_wal_last_crossing_segment() {
init_logging();
test_end_of_wal(
"test_find_end_of_wal_last_crossing_segment",
wal_generate::generate_last_wal_record_crossing_segment,
"0/3000000".parse::<Lsn>().unwrap(),
"000000010000000000000002",
);
}
/// Check the math in update_next_xid
///
/// NOTE: These checks are sensitive to the value of XID_CHECKPOINT_INTERVAL,

View File

@@ -0,0 +1,14 @@
[package]
name = "wal_generate"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
anyhow = "1.0"
clap = "3.0"
env_logger = "0.9"
log = "0.4"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
tempfile = "3.2"

View File

@@ -0,0 +1,58 @@
use anyhow::*;
use clap::{App, Arg};
use wal_generate::*;
fn main() -> Result<()> {
env_logger::Builder::from_env(
env_logger::Env::default().default_filter_or("wal_generate=info"),
)
.init();
let arg_matches = App::new("Postgres WAL generator")
.about("Generates Postgres databases with specific WAL properties")
.arg(
Arg::new("datadir")
.short('D')
.long("datadir")
.takes_value(true)
.help("Data directory for the Postgres server")
.required(true)
)
.arg(
Arg::new("pg-distrib-dir")
.long("pg-distrib-dir")
.takes_value(true)
.help("Directory with Postgres distribution (bin and lib directories, e.g. tmp_install)")
.default_value("/usr/local")
)
.arg(
Arg::new("type")
.long("type")
.takes_value(true)
.help("Type of WAL to generate")
.possible_values(["simple", "last_wal_record_crossing_segment", "wal_record_crossing_segment_followed_by_small_one"])
.required(true)
)
.get_matches();
let cfg = Conf {
pg_distrib_dir: arg_matches.value_of("pg-distrib-dir").unwrap().into(),
datadir: arg_matches.value_of("datadir").unwrap().into(),
};
cfg.initdb()?;
let mut srv = cfg.start_server()?;
let lsn = match arg_matches.value_of("type").unwrap() {
"simple" => generate_simple(&mut srv.connect_with_timeout()?)?,
"last_wal_record_crossing_segment" => {
generate_last_wal_record_crossing_segment(&mut srv.connect_with_timeout()?)?
}
"wal_record_crossing_segment_followed_by_small_one" => {
generate_wal_record_crossing_segment_followed_by_small_one(
&mut srv.connect_with_timeout()?,
)?
}
a => panic!("Unknown --type argument: {}", a),
};
println!("end_of_wal = {}", lsn);
srv.kill();
Ok(())
}

View File

@@ -0,0 +1,278 @@
use anyhow::*;
use core::time::Duration;
use log::*;
use postgres::types::PgLsn;
use postgres::Client;
use std::cmp::Ordering;
use std::path::{Path, PathBuf};
use std::process::{Command, Stdio};
use std::time::Instant;
use tempfile::{tempdir, TempDir};
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Conf {
pub pg_distrib_dir: PathBuf,
pub datadir: PathBuf,
}
pub struct PostgresServer {
process: std::process::Child,
_unix_socket_dir: TempDir,
client_config: postgres::Config,
}
impl Conf {
fn pg_bin_dir(&self) -> PathBuf {
self.pg_distrib_dir.join("bin")
}
fn pg_lib_dir(&self) -> PathBuf {
self.pg_distrib_dir.join("lib")
}
fn new_pg_command(&self, command: impl AsRef<Path>) -> Result<Command> {
let path = self.pg_bin_dir().join(command);
ensure!(path.exists(), "Command {:?} does not exist", path);
let mut cmd = Command::new(path);
cmd.env_clear()
.env("LD_LIBRARY_PATH", self.pg_lib_dir())
.env("DYLD_LIBRARY_PATH", self.pg_lib_dir());
Ok(cmd)
}
pub fn initdb(&self) -> Result<()> {
if let Some(parent) = self.datadir.parent() {
info!("Pre-creating parent directory {:?}", parent);
// Tests may be run concurrently and there may be a race to create `test_output/`.
// std::fs::create_dir_all is guaranteed to have no races with another thread creating directories.
std::fs::create_dir_all(parent)?;
}
info!(
"Running initdb in {:?} with user \"postgres\"",
self.datadir
);
let output = self
.new_pg_command("initdb")?
.arg("-D")
.arg(self.datadir.as_os_str())
.args(&["-U", "postgres", "--no-instructions", "--no-sync"])
.output()?;
debug!("initdb output: {:?}", output);
ensure!(
output.status.success(),
"initdb failed, stdout and stderr follow:\n{}{}",
String::from_utf8_lossy(&output.stdout),
String::from_utf8_lossy(&output.stderr),
);
Ok(())
}
pub fn start_server(&self) -> Result<PostgresServer> {
info!("Starting Postgres server in {:?}", self.datadir);
let unix_socket_dir = tempdir()?; // We need a directory with a short name for Unix socket (up to 108 symbols)
let unix_socket_dir_path = unix_socket_dir.path().to_owned();
let server_process = self
.new_pg_command("postgres")?
.args(&["-c", "listen_addresses="])
.arg("-k")
.arg(unix_socket_dir_path.as_os_str())
.arg("-D")
.arg(self.datadir.as_os_str())
.args(&["-c", "wal_keep_size=50MB"]) // Ensure old WAL is not removed
.args(&["-c", "logging_collector=on"]) // stderr will mess up with tests output
.args(&["-c", "shared_preload_libraries=zenith"]) // can only be loaded at startup
// Disable background processes as much as possible
.args(&["-c", "wal_writer_delay=10s"])
.args(&["-c", "autovacuum=off"])
.stderr(Stdio::null())
.spawn()?;
let server = PostgresServer {
process: server_process,
_unix_socket_dir: unix_socket_dir,
client_config: {
let mut c = postgres::Config::new();
c.host_path(&unix_socket_dir_path);
c.user("postgres");
c.connect_timeout(Duration::from_millis(1000));
c
},
};
Ok(server)
}
pub fn pg_waldump(
&self,
first_segment_name: &str,
last_segment_name: &str,
) -> Result<std::process::Output> {
let first_segment_file = self.datadir.join(first_segment_name);
let last_segment_file = self.datadir.join(last_segment_name);
info!(
"Running pg_waldump for {} .. {}",
first_segment_file.display(),
last_segment_file.display()
);
let output = self
.new_pg_command("pg_waldump")?
.args(&[
&first_segment_file.as_os_str(),
&last_segment_file.as_os_str(),
])
.output()?;
debug!("waldump output: {:?}", output);
Ok(output)
}
}
impl PostgresServer {
pub fn connect_with_timeout(&self) -> Result<Client> {
let retry_until = Instant::now() + *self.client_config.get_connect_timeout().unwrap();
while Instant::now() < retry_until {
use std::result::Result::Ok;
if let Ok(client) = self.client_config.connect(postgres::NoTls) {
return Ok(client);
}
std::thread::sleep(Duration::from_millis(100));
}
bail!("Connection timed out");
}
pub fn kill(&mut self) {
self.process.kill().unwrap();
self.process.wait().unwrap();
}
}
impl Drop for PostgresServer {
fn drop(&mut self) {
use std::result::Result::Ok;
match self.process.try_wait() {
Ok(Some(_)) => return,
Ok(None) => {
warn!("Server was not terminated, will be killed");
}
Err(e) => {
error!("Unable to get status of the server: {}, will be killed", e);
}
}
let _ = self.process.kill();
}
}
pub trait PostgresClientExt: postgres::GenericClient {
fn pg_current_wal_insert_lsn(&mut self) -> Result<PgLsn> {
Ok(self
.query_one("SELECT pg_current_wal_insert_lsn()", &[])?
.get(0))
}
fn pg_current_wal_flush_lsn(&mut self) -> Result<PgLsn> {
Ok(self
.query_one("SELECT pg_current_wal_flush_lsn()", &[])?
.get(0))
}
}
impl<C: postgres::GenericClient> PostgresClientExt for C {}
fn generate_internal<C: postgres::GenericClient>(
client: &mut C,
f: impl Fn(&mut C, PgLsn) -> Result<Option<PgLsn>>,
) -> Result<PgLsn> {
client.execute("create extension if not exists zenith_test_utils", &[])?;
let wal_segment_size = client.query_one(
"select cast(setting as bigint) as setting, unit \
from pg_settings where name = 'wal_segment_size'",
&[],
)?;
ensure!(
wal_segment_size.get::<_, String>("unit") == "B",
"Unexpected wal_segment_size unit"
);
ensure!(
wal_segment_size.get::<_, i64>("setting") == 16 * 1024 * 1024,
"Unexpected wal_segment_size in bytes"
);
let initial_lsn = client.pg_current_wal_insert_lsn()?;
info!("LSN initial = {}", initial_lsn);
let last_lsn = match f(client, initial_lsn)? {
None => client.pg_current_wal_insert_lsn()?,
Some(last_lsn) => match last_lsn.cmp(&client.pg_current_wal_insert_lsn()?) {
Ordering::Less => bail!("Some records were inserted after the generated WAL"),
Ordering::Equal => last_lsn,
Ordering::Greater => bail!("Reported LSN is greater than insert_lsn"),
},
};
// Some records may be not flushed, e.g. non-transactional logical messages.
client.execute("select neon_xlogflush(pg_current_wal_insert_lsn())", &[])?;
match last_lsn.cmp(&client.pg_current_wal_flush_lsn()?) {
Ordering::Less => bail!("Some records were flushed after the generated WAL"),
Ordering::Equal => {}
Ordering::Greater => bail!("Reported LSN is greater than flush_lsn"),
}
Ok(last_lsn)
}
pub fn generate_simple(client: &mut impl postgres::GenericClient) -> Result<PgLsn> {
generate_internal(client, |client, _| {
client.execute("CREATE table t(x int)", &[])?;
Ok(None)
})
}
fn generate_single_logical_message(
client: &mut impl postgres::GenericClient,
transactional: bool,
) -> Result<PgLsn> {
generate_internal(client, |client, initial_lsn| {
ensure!(
initial_lsn < PgLsn::from(0x0200_0000 - 1024 * 1024),
"Initial LSN is too far in the future"
);
let message_lsn: PgLsn = client
.query_one(
"select pg_logical_emit_message($1, 'big-16mb-msg', \
concat(repeat('abcd', 16 * 256 * 1024), 'end')) as message_lsn",
&[&transactional],
)?
.get("message_lsn");
ensure!(
message_lsn > PgLsn::from(0x0200_0000 + 4 * 8192),
"Logical message did not cross the segment boundary"
);
ensure!(
message_lsn < PgLsn::from(0x0400_0000),
"Logical message crossed two segments"
);
if transactional {
// Transactional logical messages are part of a transaction, so the one above is
// followed by a small COMMIT record.
let after_message_lsn = client.pg_current_wal_insert_lsn()?;
ensure!(
message_lsn < after_message_lsn,
"No record found after the emitted message"
);
Ok(Some(after_message_lsn))
} else {
Ok(Some(message_lsn))
}
})
}
pub fn generate_wal_record_crossing_segment_followed_by_small_one(
client: &mut impl postgres::GenericClient,
) -> Result<PgLsn> {
generate_single_logical_message(client, true)
}
pub fn generate_last_wal_record_crossing_segment<C: postgres::GenericClient>(
client: &mut C,
) -> Result<PgLsn> {
generate_single_logical_message(client, false)
}

View File

@@ -0,0 +1,20 @@
[package]
name = "remote_storage"
version = "0.1.0"
edition = "2021"
[dependencies]
anyhow = { version = "1.0", features = ["backtrace"] }
tokio = { version = "1.17", features = ["sync", "macros", "fs", "io-util"] }
tokio-util = { version = "0.7", features = ["io"] }
tracing = "0.1.27"
rusoto_core = "0.48"
rusoto_s3 = "0.48"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
async-trait = "0.1"
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[dev-dependencies]
tempfile = "3.2"

View File

@@ -0,0 +1,233 @@
//! A set of generic storage abstractions for the page server to use when backing up and restoring its state from the external storage.
//! No other modules from this tree are supposed to be used directly by the external code.
//!
//! [`RemoteStorage`] trait a CRUD-like generic abstraction to use for adapting external storages with a few implementations:
//! * [`local_fs`] allows to use local file system as an external storage
//! * [`s3_bucket`] uses AWS S3 bucket as an external storage
//!
mod local_fs;
mod s3_bucket;
use std::{
borrow::Cow,
collections::HashMap,
ffi::OsStr,
num::{NonZeroU32, NonZeroUsize},
path::{Path, PathBuf},
};
use anyhow::Context;
use tokio::io;
use tracing::info;
pub use self::{
local_fs::LocalFs,
s3_bucket::{S3Bucket, S3ObjectKey},
};
/// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage.
/// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency
/// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach.
/// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed.
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS: usize = 50;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
/// Currently, sync happens with AWS S3, that has two limits on requests per second:
/// ~200 RPS for IAM services
/// https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html
/// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests
/// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100;
/// Storage (potentially remote) API to manage its state.
/// This storage tries to be unaware of any layered repository context,
/// providing basic CRUD operations for storage files.
#[async_trait::async_trait]
pub trait RemoteStorage: Send + Sync {
/// A way to uniquely reference a file in the remote storage.
type RemoteObjectId;
/// Attempts to derive the storage path out of the local path, if the latter is correct.
fn remote_object_id(&self, local_path: &Path) -> anyhow::Result<Self::RemoteObjectId>;
/// Gets the download path of the given storage file.
fn local_path(&self, remote_object_id: &Self::RemoteObjectId) -> anyhow::Result<PathBuf>;
/// Lists all items the storage has right now.
async fn list(&self) -> anyhow::Result<Vec<Self::RemoteObjectId>>;
/// Streams the local file contents into remote into the remote storage entry.
async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
// S3 PUT request requires the content length to be specified,
// otherwise it starts to fail with the concurrent connection count increasing.
from_size_bytes: usize,
to: &Self::RemoteObjectId,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<()>;
/// Streams the remote storage entry contents into the buffered writer given, returns the filled writer.
/// Returns the metadata, if any was stored with the file previously.
async fn download(
&self,
from: &Self::RemoteObjectId,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<Option<StorageMetadata>>;
/// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer.
/// Returns the metadata, if any was stored with the file previously.
async fn download_byte_range(
&self,
from: &Self::RemoteObjectId,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<Option<StorageMetadata>>;
async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()>;
}
/// Every storage, currently supported.
/// Serves as a simple way to pass around the [`RemoteStorage`] without dealing with generics.
pub enum GenericRemoteStorage {
Local(LocalFs),
S3(S3Bucket),
}
impl GenericRemoteStorage {
pub fn new(
working_directory: PathBuf,
storage_config: &RemoteStorageConfig,
) -> anyhow::Result<Self> {
match &storage_config.storage {
RemoteStorageKind::LocalFs(root) => {
info!("Using fs root '{}' as a remote storage", root.display());
LocalFs::new(root.clone(), working_directory).map(GenericRemoteStorage::Local)
}
RemoteStorageKind::AwsS3(s3_config) => {
info!("Using s3 bucket '{}' in region '{}' as a remote storage, prefix in bucket: '{:?}', bucket endpoint: '{:?}'",
s3_config.bucket_name, s3_config.bucket_region, s3_config.prefix_in_bucket, s3_config.endpoint);
S3Bucket::new(s3_config, working_directory).map(GenericRemoteStorage::S3)
}
}
}
}
/// Extra set of key-value pairs that contain arbitrary metadata about the storage entry.
/// Immutable, cannot be changed once the file is created.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct StorageMetadata(HashMap<String, String>);
fn strip_path_prefix<'a>(prefix: &'a Path, path: &'a Path) -> anyhow::Result<&'a Path> {
if prefix == path {
anyhow::bail!(
"Prefix and the path are equal, cannot strip: '{}'",
prefix.display()
)
} else {
path.strip_prefix(prefix).with_context(|| {
format!(
"Path '{}' is not prefixed with '{}'",
path.display(),
prefix.display(),
)
})
}
}
/// External backup storage configuration, enough for creating a client for that storage.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between the API user and the remote storage.
pub max_concurrent_syncs: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration.
pub storage: RemoteStorageKind,
}
/// A kind of a remote storage to connect to, with its connection configuration.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum RemoteStorageKind {
/// Storage based on local file system.
/// Specify a root folder to place all stored files into.
LocalFs(PathBuf),
/// AWS S3 based storage, storing all files in the S3 bucket
/// specified by the config
AwsS3(S3Config),
}
/// AWS S3 bucket coordinates and access credentials to manage the bucket contents (read and write).
#[derive(Clone, PartialEq, Eq)]
pub struct S3Config {
/// Name of the bucket to connect to.
pub bucket_name: String,
/// The region where the bucket is located at.
pub bucket_region: String,
/// A "subfolder" in the bucket, to use the same bucket separately by multiple remote storage users at once.
pub prefix_in_bucket: Option<String>,
/// A base URL to send S3 requests to.
/// By default, the endpoint is derived from a region name, assuming it's
/// an AWS S3 region name, erroring on wrong region name.
/// Endpoint provides a way to support other S3 flavors and their regions.
///
/// Example: `http://127.0.0.1:5000`
pub endpoint: Option<String>,
/// AWS S3 has various limits on its API calls, we need not to exceed those.
/// See [`DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details.
pub concurrency_limit: NonZeroUsize,
}
impl std::fmt::Debug for S3Config {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("S3Config")
.field("bucket_name", &self.bucket_name)
.field("bucket_region", &self.bucket_region)
.field("prefix_in_bucket", &self.prefix_in_bucket)
.field("concurrency_limit", &self.concurrency_limit)
.finish()
}
}
pub fn path_with_suffix_extension(original_path: impl AsRef<Path>, suffix: &str) -> PathBuf {
let new_extension = match original_path
.as_ref()
.extension()
.map(OsStr::to_string_lossy)
{
Some(extension) => Cow::Owned(format!("{extension}.{suffix}")),
None => Cow::Borrowed(suffix),
};
original_path
.as_ref()
.with_extension(new_extension.as_ref())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_path_with_suffix_extension() {
let p = PathBuf::from("/foo/bar");
assert_eq!(
&path_with_suffix_extension(&p, "temp").to_string_lossy(),
"/foo/bar.temp"
);
let p = PathBuf::from("/foo/bar");
assert_eq!(
&path_with_suffix_extension(&p, "temp.temp").to_string_lossy(),
"/foo/bar.temp.temp"
);
let p = PathBuf::from("/foo/bar.baz");
assert_eq!(
&path_with_suffix_extension(&p, "temp.temp").to_string_lossy(),
"/foo/bar.baz.temp.temp"
);
let p = PathBuf::from("/foo/bar.baz");
assert_eq!(
&path_with_suffix_extension(&p, ".temp").to_string_lossy(),
"/foo/bar.baz..temp"
);
}
}

View File

@@ -1,7 +1,7 @@
//! Local filesystem acting as a remote storage.
//! Multiple pageservers can use the same "storage" of this kind by using different storage roots.
//! Multiple API users can use the same "storage" of this kind by using different storage roots.
//!
//! This storage used in pageserver tests, but can also be used in cases when a certain persistent
//! This storage used in tests, but can also be used in cases when a certain persistent
//! volume is mounted to the local FS.
use std::{
@@ -17,16 +17,18 @@ use tokio::{
};
use tracing::*;
use crate::path_with_suffix_extension;
use super::{strip_path_prefix, RemoteStorage, StorageMetadata};
pub struct LocalFs {
pageserver_workdir: &'static Path,
root: PathBuf,
working_directory: PathBuf,
storage_root: PathBuf,
}
impl LocalFs {
/// Attempts to create local FS storage, along with its root directory.
pub fn new(root: PathBuf, pageserver_workdir: &'static Path) -> anyhow::Result<Self> {
pub fn new(root: PathBuf, working_directory: PathBuf) -> anyhow::Result<Self> {
if !root.exists() {
std::fs::create_dir_all(&root).with_context(|| {
format!(
@@ -36,15 +38,15 @@ impl LocalFs {
})?;
}
Ok(Self {
pageserver_workdir,
root,
working_directory,
storage_root: root,
})
}
fn resolve_in_storage(&self, path: &Path) -> anyhow::Result<PathBuf> {
if path.is_relative() {
Ok(self.root.join(path))
} else if path.starts_with(&self.root) {
Ok(self.storage_root.join(path))
} else if path.starts_with(&self.storage_root) {
Ok(path.to_path_buf())
} else {
bail!(
@@ -83,30 +85,30 @@ impl LocalFs {
#[async_trait::async_trait]
impl RemoteStorage for LocalFs {
type StoragePath = PathBuf;
type RemoteObjectId = PathBuf;
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath> {
Ok(self.root.join(
strip_path_prefix(self.pageserver_workdir, local_path)
fn remote_object_id(&self, local_path: &Path) -> anyhow::Result<Self::RemoteObjectId> {
Ok(self.storage_root.join(
strip_path_prefix(&self.working_directory, local_path)
.context("local path does not belong to this storage")?,
))
}
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf> {
let relative_path = strip_path_prefix(&self.root, storage_path)
fn local_path(&self, storage_path: &Self::RemoteObjectId) -> anyhow::Result<PathBuf> {
let relative_path = strip_path_prefix(&self.storage_root, storage_path)
.context("local path does not belong to this storage")?;
Ok(self.pageserver_workdir.join(relative_path))
Ok(self.working_directory.join(relative_path))
}
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>> {
get_all_files(&self.root).await
async fn list(&self) -> anyhow::Result<Vec<Self::RemoteObjectId>> {
get_all_files(&self.storage_root).await
}
async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
from_size_bytes: usize,
to: &Self::StoragePath,
to: &Self::RemoteObjectId,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<()> {
let target_file_path = self.resolve_in_storage(to)?;
@@ -114,7 +116,7 @@ impl RemoteStorage for LocalFs {
// We need this dance with sort of durable rename (without fsyncs)
// to prevent partial uploads. This was really hit when pageserver shutdown
// cancelled the upload and partial file was left on the fs
let temp_file_path = path_with_suffix_extension(&target_file_path, ".temp");
let temp_file_path = path_with_suffix_extension(&target_file_path, "temp");
let mut destination = io::BufWriter::new(
fs::OpenOptions::new()
.write(true)
@@ -192,7 +194,7 @@ impl RemoteStorage for LocalFs {
async fn download(
&self,
from: &Self::StoragePath,
from: &Self::RemoteObjectId,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<Option<StorageMetadata>> {
let file_path = self.resolve_in_storage(from)?;
@@ -227,9 +229,9 @@ impl RemoteStorage for LocalFs {
}
}
async fn download_range(
async fn download_byte_range(
&self,
from: &Self::StoragePath,
from: &Self::RemoteObjectId,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
@@ -286,7 +288,7 @@ impl RemoteStorage for LocalFs {
}
}
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> {
async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()> {
let file_path = self.resolve_in_storage(path)?;
if file_path.exists() && file_path.is_file() {
Ok(fs::remove_file(file_path).await?)
@@ -299,15 +301,8 @@ impl RemoteStorage for LocalFs {
}
}
fn path_with_suffix_extension(original_path: &Path, suffix: &str) -> PathBuf {
let mut extension_with_suffix = original_path.extension().unwrap_or_default().to_os_string();
extension_with_suffix.push(suffix);
original_path.with_extension(extension_with_suffix)
}
fn storage_metadata_path(original_path: &Path) -> PathBuf {
path_with_suffix_extension(original_path, ".metadata")
path_with_suffix_extension(original_path, "metadata")
}
fn get_all_files<'a, P>(
@@ -359,29 +354,30 @@ async fn create_target_directory(target_file_path: &Path) -> anyhow::Result<()>
#[cfg(test)]
mod pure_tests {
use crate::{
layered_repository::metadata::METADATA_FILE_NAME,
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use tempfile::tempdir;
use super::*;
#[test]
fn storage_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("storage_path_positive")?;
let workdir = tempdir()?.path().to_owned();
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root.clone(),
working_directory: workdir.clone(),
storage_root: storage_root.clone(),
};
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("file_name");
let expected_path = storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?);
let local_path = workdir
.join("timelines")
.join("some_timeline")
.join("file_name");
let expected_path = storage_root.join(local_path.strip_prefix(&workdir)?);
assert_eq!(
expected_path,
storage.storage_path(&local_path).expect("Matching path should map to storage path normally"),
"File paths from pageserver workdir should be stored in local fs storage with the same path they have relative to the workdir"
storage.remote_object_id(&local_path).expect("Matching path should map to storage path normally"),
"File paths from workdir should be stored in local fs storage with the same path they have relative to the workdir"
);
Ok(())
@@ -391,7 +387,7 @@ mod pure_tests {
fn storage_path_negatives() -> anyhow::Result<()> {
#[track_caller]
fn storage_path_error(storage: &LocalFs, mismatching_path: &Path) -> String {
match storage.storage_path(mismatching_path) {
match storage.remote_object_id(mismatching_path) {
Ok(wrong_path) => panic!(
"Expected path '{}' to error, but got storage path: {:?}",
mismatching_path.display(),
@@ -401,16 +397,16 @@ mod pure_tests {
}
}
let repo_harness = RepoHarness::create("storage_path_negatives")?;
let workdir = tempdir()?.path().to_owned();
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
working_directory: workdir.clone(),
storage_root,
};
let error_string = storage_path_error(&storage, &repo_harness.conf.workdir);
let error_string = storage_path_error(&storage, &workdir);
assert!(error_string.contains("does not belong to this storage"));
assert!(error_string.contains(repo_harness.conf.workdir.to_str().unwrap()));
assert!(error_string.contains(workdir.to_str().unwrap()));
let mismatching_path_str = "/something/else";
let error_message = storage_path_error(&storage, Path::new(mismatching_path_str));
@@ -419,7 +415,7 @@ mod pure_tests {
"Error should mention wrong path"
);
assert!(
error_message.contains(repo_harness.conf.workdir.to_str().unwrap()),
error_message.contains(workdir.to_str().unwrap()),
"Error should mention server workdir"
);
assert!(error_message.contains("does not belong to this storage"));
@@ -429,29 +425,28 @@ mod pure_tests {
#[test]
fn local_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("local_path_positive")?;
let workdir = tempdir()?.path().to_owned();
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root.clone(),
working_directory: workdir.clone(),
storage_root: storage_root.clone(),
};
let name = "not a metadata";
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join(name);
let local_path = workdir.join("timelines").join("some_timeline").join(name);
assert_eq!(
local_path,
storage
.local_path(
&storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?)
)
.local_path(&storage_root.join(local_path.strip_prefix(&workdir)?))
.expect("For a valid input, valid local path should be parsed"),
"Should be able to parse metadata out of the correctly named remote delta file"
);
let local_metadata_path = repo_harness
.timeline_path(&TIMELINE_ID)
.join(METADATA_FILE_NAME);
let remote_metadata_path = storage.storage_path(&local_metadata_path)?;
let local_metadata_path = workdir
.join("timelines")
.join("some_timeline")
.join("metadata");
let remote_metadata_path = storage.remote_object_id(&local_metadata_path)?;
assert_eq!(
local_metadata_path,
storage
@@ -477,11 +472,10 @@ mod pure_tests {
}
}
let repo_harness = RepoHarness::create("local_path_negatives")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
working_directory: tempdir()?.path().to_owned(),
storage_root,
};
let totally_wrong_path = "wrong_wrong_wrong";
@@ -493,16 +487,19 @@ mod pure_tests {
#[test]
fn download_destination_matches_original_path() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination_matches_original_path")?;
let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name");
let workdir = tempdir()?.path().to_owned();
let original_path = workdir
.join("timelines")
.join("some_timeline")
.join("some name");
let storage_root = PathBuf::from("somewhere").join("else");
let dummy_storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
working_directory: workdir,
storage_root,
};
let storage_path = dummy_storage.storage_path(&original_path)?;
let storage_path = dummy_storage.remote_object_id(&original_path)?;
let download_destination = dummy_storage.local_path(&storage_path)?;
assert_eq!(
@@ -517,18 +514,17 @@ mod pure_tests {
#[cfg(test)]
mod fs_tests {
use super::*;
use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID};
use std::{collections::HashMap, io::Write};
use tempfile::tempdir;
#[tokio::test]
async fn upload_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("upload_file")?;
let workdir = tempdir()?.path().to_owned();
let storage = create_storage()?;
let (file, size) = create_file_for_upload(
&storage.pageserver_workdir.join("whatever"),
&storage.working_directory.join("whatever"),
"whatever_contents",
)
.await?;
@@ -543,14 +539,14 @@ mod fs_tests {
}
assert!(storage.list().await?.is_empty());
let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1", None).await?;
let target_path_1 = upload_dummy_file(&workdir, &storage, "upload_1", None).await?;
assert_eq!(
storage.list().await?,
vec![target_path_1.clone()],
"Should list a single file after first upload"
);
let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2", None).await?;
let target_path_2 = upload_dummy_file(&workdir, &storage, "upload_2", None).await?;
assert_eq!(
list_files_sorted(&storage).await?,
vec![target_path_1.clone(), target_path_2.clone()],
@@ -561,17 +557,16 @@ mod fs_tests {
}
fn create_storage() -> anyhow::Result<LocalFs> {
let pageserver_workdir = Box::leak(Box::new(tempdir()?.path().to_owned()));
let storage = LocalFs::new(tempdir()?.path().to_owned(), pageserver_workdir)?;
Ok(storage)
LocalFs::new(tempdir()?.path().to_owned(), tempdir()?.path().to_owned())
}
#[tokio::test]
async fn download_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file")?;
let workdir = tempdir()?.path().to_owned();
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?;
let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let metadata = storage.download(&upload_target, &mut content_bytes).await?;
@@ -602,14 +597,15 @@ mod fs_tests {
#[tokio::test]
async fn download_file_range_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file_range_positive")?;
let workdir = tempdir()?.path().to_owned();
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?;
let mut full_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let metadata = storage
.download_range(&upload_target, 0, None, &mut full_range_bytes)
.download_byte_range(&upload_target, 0, None, &mut full_range_bytes)
.await?;
assert!(
metadata.is_none(),
@@ -625,7 +621,7 @@ mod fs_tests {
let mut zero_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let same_byte = 1_000_000_000;
let metadata = storage
.download_range(
.download_byte_range(
&upload_target,
same_byte,
Some(same_byte + 1), // exclusive end
@@ -647,7 +643,7 @@ mod fs_tests {
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let metadata = storage
.download_range(
.download_byte_range(
&upload_target,
0,
Some(first_part_local.len() as u64),
@@ -669,7 +665,7 @@ mod fs_tests {
let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let metadata = storage
.download_range(
.download_byte_range(
&upload_target,
first_part_local.len() as u64,
Some((first_part_local.len() + second_part_local.len()) as u64),
@@ -694,16 +690,17 @@ mod fs_tests {
#[tokio::test]
async fn download_file_range_negative() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file_range_negative")?;
let workdir = tempdir()?.path().to_owned();
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?;
let start = 10000;
let end = 234;
assert!(start > end, "Should test an incorrect range");
match storage
.download_range(&upload_target, start, Some(end), &mut io::sink())
.download_byte_range(&upload_target, start, Some(end), &mut io::sink())
.await
{
Ok(_) => panic!("Should not allow downloading wrong ranges"),
@@ -717,7 +714,7 @@ mod fs_tests {
let non_existing_path = PathBuf::from("somewhere").join("else");
match storage
.download_range(&non_existing_path, 1, Some(3), &mut io::sink())
.download_byte_range(&non_existing_path, 1, Some(3), &mut io::sink())
.await
{
Ok(_) => panic!("Should not allow downloading non-existing storage file ranges"),
@@ -732,10 +729,11 @@ mod fs_tests {
#[tokio::test]
async fn delete_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("delete_file")?;
let workdir = tempdir()?.path().to_owned();
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?;
storage.delete(&upload_target).await?;
assert!(storage.list().await?.is_empty());
@@ -753,7 +751,8 @@ mod fs_tests {
#[tokio::test]
async fn file_with_metadata() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file")?;
let workdir = tempdir()?.path().to_owned();
let storage = create_storage()?;
let upload_name = "upload_1";
let metadata = StorageMetadata(HashMap::from([
@@ -761,7 +760,7 @@ mod fs_tests {
("two".to_string(), "2".to_string()),
]));
let upload_target =
upload_dummy_file(&repo_harness, &storage, upload_name, Some(metadata.clone())).await?;
upload_dummy_file(&workdir, &storage, upload_name, Some(metadata.clone())).await?;
let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let full_download_metadata = storage.download(&upload_target, &mut content_bytes).await?;
@@ -785,7 +784,7 @@ mod fs_tests {
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let partial_download_metadata = storage
.download_range(
.download_byte_range(
&upload_target,
0,
Some(first_part_local.len() as u64),
@@ -810,16 +809,16 @@ mod fs_tests {
}
async fn upload_dummy_file(
harness: &RepoHarness<'_>,
workdir: &Path,
storage: &LocalFs,
name: &str,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<PathBuf> {
let timeline_path = harness.timeline_path(&TIMELINE_ID);
let relative_timeline_path = timeline_path.strip_prefix(&harness.conf.workdir)?;
let storage_path = storage.root.join(relative_timeline_path).join(name);
let timeline_path = workdir.join("timelines").join("some_timeline");
let relative_timeline_path = timeline_path.strip_prefix(&workdir)?;
let storage_path = storage.storage_root.join(relative_timeline_path).join(name);
let from_path = storage.pageserver_workdir.join(name);
let from_path = storage.working_directory.join(name);
let (file, size) = create_file_for_upload(&from_path, &dummy_contents(name)).await?;
storage.upload(file, size, &storage_path, metadata).await?;
Ok(storage_path)

View File

@@ -1,7 +1,7 @@
//! AWS S3 storage wrapper around `rusoto` library.
//!
//! Respects `prefix_in_bucket` property from [`S3Config`],
//! allowing multiple pageservers to independently work with the same S3 bucket, if
//! allowing multiple api users to independently work with the same S3 bucket, if
//! their bucket prefixes are both specified and different.
use std::path::{Path, PathBuf};
@@ -19,16 +19,13 @@ use tokio::{io, sync::Semaphore};
use tokio_util::io::ReaderStream;
use tracing::debug;
use crate::{
config::S3Config,
remote_storage::{strip_path_prefix, RemoteStorage},
};
use crate::{strip_path_prefix, RemoteStorage, S3Config};
use super::StorageMetadata;
const S3_FILE_SEPARATOR: char = '/';
const S3_PREFIX_SEPARATOR: char = '/';
#[derive(Debug, Eq, PartialEq)]
#[derive(Debug, Eq, PartialEq, PartialOrd, Ord, Hash)]
pub struct S3ObjectKey(String);
impl S3ObjectKey {
@@ -36,11 +33,7 @@ impl S3ObjectKey {
&self.0
}
fn download_destination(
&self,
pageserver_workdir: &Path,
prefix_to_strip: Option<&str>,
) -> PathBuf {
fn download_destination(&self, workdir: &Path, prefix_to_strip: Option<&str>) -> PathBuf {
let path_without_prefix = match prefix_to_strip {
Some(prefix) => self.0.strip_prefix(prefix).unwrap_or_else(|| {
panic!(
@@ -51,9 +44,9 @@ impl S3ObjectKey {
None => &self.0,
};
pageserver_workdir.join(
workdir.join(
path_without_prefix
.split(S3_FILE_SEPARATOR)
.split(S3_PREFIX_SEPARATOR)
.collect::<PathBuf>(),
)
}
@@ -61,7 +54,7 @@ impl S3ObjectKey {
/// AWS S3 storage.
pub struct S3Bucket {
pageserver_workdir: &'static Path,
workdir: PathBuf,
client: S3Client,
bucket_name: String,
prefix_in_bucket: Option<String>,
@@ -73,7 +66,7 @@ pub struct S3Bucket {
impl S3Bucket {
/// Creates the S3 storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config, pageserver_workdir: &'static Path) -> anyhow::Result<Self> {
pub fn new(aws_config: &S3Config, workdir: PathBuf) -> anyhow::Result<Self> {
debug!(
"Creating s3 remote storage for S3 bucket {}",
aws_config.bucket_name
@@ -89,8 +82,11 @@ impl S3Bucket {
.context("Failed to parse the s3 region from config")?,
};
let request_dispatcher = HttpClient::new().context("Failed to create S3 http client")?;
let client = if aws_config.access_key_id.is_none() && aws_config.secret_access_key.is_none()
{
let access_key_id = std::env::var("AWS_ACCESS_KEY_ID").ok();
let secret_access_key = std::env::var("AWS_SECRET_ACCESS_KEY").ok();
let client = if access_key_id.is_none() && secret_access_key.is_none() {
debug!("Using IAM-based AWS access");
S3Client::new_with(request_dispatcher, InstanceMetadataProvider::new(), region)
} else {
@@ -98,8 +94,8 @@ impl S3Bucket {
S3Client::new_with(
request_dispatcher,
StaticProvider::new_minimal(
aws_config.access_key_id.clone().unwrap_or_default(),
aws_config.secret_access_key.clone().unwrap_or_default(),
access_key_id.unwrap_or_default(),
secret_access_key.unwrap_or_default(),
),
region,
)
@@ -107,12 +103,12 @@ impl S3Bucket {
let prefix_in_bucket = aws_config.prefix_in_bucket.as_deref().map(|prefix| {
let mut prefix = prefix;
while prefix.starts_with(S3_FILE_SEPARATOR) {
while prefix.starts_with(S3_PREFIX_SEPARATOR) {
prefix = &prefix[1..]
}
let mut prefix = prefix.to_string();
while prefix.ends_with(S3_FILE_SEPARATOR) {
while prefix.ends_with(S3_PREFIX_SEPARATOR) {
prefix.pop();
}
prefix
@@ -120,7 +116,7 @@ impl S3Bucket {
Ok(Self {
client,
pageserver_workdir,
workdir,
bucket_name: aws_config.bucket_name.clone(),
prefix_in_bucket,
concurrency_limiter: Semaphore::new(aws_config.concurrency_limit.get()),
@@ -130,24 +126,23 @@ impl S3Bucket {
#[async_trait::async_trait]
impl RemoteStorage for S3Bucket {
type StoragePath = S3ObjectKey;
type RemoteObjectId = S3ObjectKey;
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath> {
let relative_path = strip_path_prefix(self.pageserver_workdir, local_path)?;
fn remote_object_id(&self, local_path: &Path) -> anyhow::Result<Self::RemoteObjectId> {
let relative_path = strip_path_prefix(&self.workdir, local_path)?;
let mut key = self.prefix_in_bucket.clone().unwrap_or_default();
for segment in relative_path {
key.push(S3_FILE_SEPARATOR);
key.push(S3_PREFIX_SEPARATOR);
key.push_str(&segment.to_string_lossy());
}
Ok(S3ObjectKey(key))
}
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf> {
Ok(storage_path
.download_destination(self.pageserver_workdir, self.prefix_in_bucket.as_deref()))
fn local_path(&self, storage_path: &Self::RemoteObjectId) -> anyhow::Result<PathBuf> {
Ok(storage_path.download_destination(&self.workdir, self.prefix_in_bucket.as_deref()))
}
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>> {
async fn list(&self) -> anyhow::Result<Vec<Self::RemoteObjectId>> {
let mut document_keys = Vec::new();
let mut continuation_token = None;
@@ -187,7 +182,7 @@ impl RemoteStorage for S3Bucket {
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
from_size_bytes: usize,
to: &Self::StoragePath,
to: &Self::RemoteObjectId,
metadata: Option<StorageMetadata>,
) -> anyhow::Result<()> {
let _guard = self
@@ -212,7 +207,7 @@ impl RemoteStorage for S3Bucket {
async fn download(
&self,
from: &Self::StoragePath,
from: &Self::RemoteObjectId,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<Option<StorageMetadata>> {
let _guard = self
@@ -237,9 +232,9 @@ impl RemoteStorage for S3Bucket {
Ok(object_output.metadata.map(StorageMetadata))
}
async fn download_range(
async fn download_byte_range(
&self,
from: &Self::StoragePath,
from: &Self::RemoteObjectId,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
@@ -274,7 +269,7 @@ impl RemoteStorage for S3Bucket {
Ok(object_output.metadata.map(StorageMetadata))
}
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> {
async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()> {
let _guard = self
.concurrency_limiter
.acquire()
@@ -293,34 +288,30 @@ impl RemoteStorage for S3Bucket {
#[cfg(test)]
mod tests {
use crate::{
layered_repository::metadata::METADATA_FILE_NAME,
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use tempfile::tempdir;
use super::*;
#[test]
fn download_destination() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination")?;
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("test_name");
let relative_path = local_path.strip_prefix(&repo_harness.conf.workdir)?;
let workdir = tempdir()?.path().to_owned();
let local_path = workdir.join("one").join("two").join("test_name");
let relative_path = local_path.strip_prefix(&workdir)?;
let key = S3ObjectKey(format!(
"{}{}",
S3_FILE_SEPARATOR,
S3_PREFIX_SEPARATOR,
relative_path
.iter()
.map(|segment| segment.to_str().unwrap())
.collect::<Vec<_>>()
.join(&S3_FILE_SEPARATOR.to_string()),
.join(&S3_PREFIX_SEPARATOR.to_string()),
));
assert_eq!(
local_path,
key.download_destination(&repo_harness.conf.workdir, None),
"Download destination should consist of s3 path joined with the pageserver workdir prefix"
key.download_destination(&workdir, None),
"Download destination should consist of s3 path joined with the workdir prefix"
);
Ok(())
@@ -328,24 +319,21 @@ mod tests {
#[test]
fn storage_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("storage_path_positive")?;
let workdir = tempdir()?.path().to_owned();
let segment_1 = "matching";
let segment_2 = "file";
let local_path = &repo_harness.conf.workdir.join(segment_1).join(segment_2);
let local_path = &workdir.join(segment_1).join(segment_2);
let storage = dummy_storage(&repo_harness.conf.workdir);
let storage = dummy_storage(workdir);
let expected_key = S3ObjectKey(format!(
"{}{SEPARATOR}{}{SEPARATOR}{}",
"{}{S3_PREFIX_SEPARATOR}{segment_1}{S3_PREFIX_SEPARATOR}{segment_2}",
storage.prefix_in_bucket.as_deref().unwrap_or_default(),
segment_1,
segment_2,
SEPARATOR = S3_FILE_SEPARATOR,
));
let actual_key = storage
.storage_path(local_path)
.remote_object_id(local_path)
.expect("Matching path should map to S3 path normally");
assert_eq!(
expected_key,
@@ -360,7 +348,7 @@ mod tests {
fn storage_path_negatives() -> anyhow::Result<()> {
#[track_caller]
fn storage_path_error(storage: &S3Bucket, mismatching_path: &Path) -> String {
match storage.storage_path(mismatching_path) {
match storage.remote_object_id(mismatching_path) {
Ok(wrong_key) => panic!(
"Expected path '{}' to error, but got S3 key: {:?}",
mismatching_path.display(),
@@ -370,10 +358,10 @@ mod tests {
}
}
let repo_harness = RepoHarness::create("storage_path_negatives")?;
let storage = dummy_storage(&repo_harness.conf.workdir);
let workdir = tempdir()?.path().to_owned();
let storage = dummy_storage(workdir.clone());
let error_message = storage_path_error(&storage, &repo_harness.conf.workdir);
let error_message = storage_path_error(&storage, &workdir);
assert!(
error_message.contains("Prefix and the path are equal"),
"Message '{}' does not contain the required string",
@@ -387,7 +375,7 @@ mod tests {
"Error should mention wrong path"
);
assert!(
error_message.contains(repo_harness.conf.workdir.to_str().unwrap()),
error_message.contains(workdir.to_str().unwrap()),
"Error should mention server workdir"
);
assert!(
@@ -401,20 +389,17 @@ mod tests {
#[test]
fn local_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("local_path_positive")?;
let storage = dummy_storage(&repo_harness.conf.workdir);
let timeline_dir = repo_harness.timeline_path(&TIMELINE_ID);
let relative_timeline_path = timeline_dir.strip_prefix(&repo_harness.conf.workdir)?;
let workdir = tempdir()?.path().to_owned();
let storage = dummy_storage(workdir.clone());
let timeline_dir = workdir.join("timelines").join("test_timeline");
let relative_timeline_path = timeline_dir.strip_prefix(&workdir)?;
let s3_key = create_s3_key(
&relative_timeline_path.join("not a metadata"),
storage.prefix_in_bucket.as_deref(),
);
assert_eq!(
s3_key.download_destination(
&repo_harness.conf.workdir,
storage.prefix_in_bucket.as_deref()
),
s3_key.download_destination(&workdir, storage.prefix_in_bucket.as_deref()),
storage
.local_path(&s3_key)
.expect("For a valid input, valid S3 info should be parsed"),
@@ -422,14 +407,11 @@ mod tests {
);
let s3_key = create_s3_key(
&relative_timeline_path.join(METADATA_FILE_NAME),
&relative_timeline_path.join("metadata"),
storage.prefix_in_bucket.as_deref(),
);
assert_eq!(
s3_key.download_destination(
&repo_harness.conf.workdir,
storage.prefix_in_bucket.as_deref()
),
s3_key.download_destination(&workdir, storage.prefix_in_bucket.as_deref()),
storage
.local_path(&s3_key)
.expect("For a valid input, valid S3 info should be parsed"),
@@ -441,12 +423,15 @@ mod tests {
#[test]
fn download_destination_matches_original_path() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination_matches_original_path")?;
let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name");
let workdir = tempdir()?.path().to_owned();
let original_path = workdir
.join("timelines")
.join("some_timeline")
.join("some name");
let dummy_storage = dummy_storage(&repo_harness.conf.workdir);
let dummy_storage = dummy_storage(workdir);
let key = dummy_storage.storage_path(&original_path)?;
let key = dummy_storage.remote_object_id(&original_path)?;
let download_destination = dummy_storage.local_path(&key)?;
assert_eq!(
@@ -457,9 +442,9 @@ mod tests {
Ok(())
}
fn dummy_storage(pageserver_workdir: &'static Path) -> S3Bucket {
fn dummy_storage(workdir: PathBuf) -> S3Bucket {
S3Bucket {
pageserver_workdir,
workdir,
client: S3Client::new("us-east-1".parse().unwrap()),
bucket_name: "dummy-bucket".to_string(),
prefix_in_bucket: Some("dummy_prefix/".to_string()),
@@ -471,7 +456,7 @@ mod tests {
S3ObjectKey(relative_file_path.iter().fold(
prefix.unwrap_or_default().to_string(),
|mut path_string, segment| {
path_string.push(S3_FILE_SEPARATOR);
path_string.push(S3_PREFIX_SEPARATOR);
path_string.push_str(segment.to_str().unwrap());
path_string
},

View File

@@ -1,3 +0,0 @@
fn main() {
println!("cargo:rerun-if-env-changed=GIT_VERSION");
}

View File

@@ -5,7 +5,7 @@ use anyhow::anyhow;
use hyper::header::AUTHORIZATION;
use hyper::{header::CONTENT_TYPE, Body, Request, Response, Server};
use lazy_static::lazy_static;
use metrics::{new_common_metric_name, register_int_counter, Encoder, IntCounter, TextEncoder};
use metrics::{register_int_counter, Encoder, IntCounter, TextEncoder};
use routerify::ext::RequestExt;
use routerify::RequestInfo;
use routerify::{Middleware, Router, RouterBuilder, RouterService};
@@ -18,7 +18,7 @@ use super::error::ApiError;
lazy_static! {
static ref SERVE_METRICS_COUNT: IntCounter = register_int_counter!(
new_common_metric_name("serve_metrics_count"),
"libmetrics_metric_handler_requests_total",
"Number of metric requests made"
)
.expect("failed to define a metric");

View File

@@ -1,7 +1,7 @@
use std::str::FromStr;
use super::error::ApiError;
use hyper::{Body, Request};
use hyper::{body::HttpBody, Body, Request};
use routerify::ext::RequestExt;
pub fn get_request_param<'a>(
@@ -31,3 +31,10 @@ pub fn parse_request_param<T: FromStr>(
))),
}
}
pub async fn ensure_no_body(request: &mut Request<Body>) -> Result<(), ApiError> {
match request.body_mut().data().await {
Some(_) => Err(ApiError::BadRequest("Unexpected request body".into())),
None => Ok(()),
}
}

View File

@@ -54,31 +54,52 @@ pub mod nonblock;
// Default signal handling
pub mod signals;
// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
//
// we have several cases:
// * building locally from git repo
// * building in CI from git repo
// * building in docker (either in CI or locally)
//
// One thing to note is that .git is not available in docker (and it is bad to include it there).
// So everything becides docker build is covered by git_version crate.
// For docker use environment variable to pass git version, which is then retrieved by buildscript (build.rs).
// It takes variable from build process env and puts it to the rustc env. And then we can retrieve it here by using env! macro.
// Git version received from environment variable used as a fallback in git_version invokation.
// And to avoid running buildscript every recompilation, we use rerun-if-env-changed option.
// So the build script will be run only when GIT_VERSION envvar has changed.
//
// Why not to use buildscript to get git commit sha directly without procmacro from different crate?
// Caching and workspaces complicates that. In case `utils` is not
// recompiled due to caching then version may become outdated.
// git_version crate handles that case by introducing a dependency on .git internals via include_bytes! macro,
// so if we changed the index state git_version will pick that up and rerun the macro.
//
// Note that with git_version prefix is `git:` and in case of git version from env its `git-env:`.
use git_version::git_version;
pub const GIT_VERSION: &str = git_version!(
prefix = "git:",
fallback = concat!("git-env:", env!("GIT_VERSION")),
args = ["--abbrev=40", "--always", "--dirty=-modified"] // always use full sha
);
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
///
/// we have several cases:
/// * building locally from git repo
/// * building in CI from git repo
/// * building in docker (either in CI or locally)
///
/// One thing to note is that .git is not available in docker (and it is bad to include it there).
/// So everything becides docker build is covered by git_version crate, and docker uses a `GIT_VERSION` argument to get the value required.
/// It takes variable from build process env and puts it to the rustc env. And then we can retrieve it here by using env! macro.
/// Git version received from environment variable used as a fallback in git_version invokation.
/// And to avoid running buildscript every recompilation, we use rerun-if-env-changed option.
/// So the build script will be run only when GIT_VERSION envvar has changed.
///
/// Why not to use buildscript to get git commit sha directly without procmacro from different crate?
/// Caching and workspaces complicates that. In case `utils` is not
/// recompiled due to caching then version may become outdated.
/// git_version crate handles that case by introducing a dependency on .git internals via include_bytes! macro,
/// so if we changed the index state git_version will pick that up and rerun the macro.
///
/// Note that with git_version prefix is `git:` and in case of git version from env its `git-env:`.
///
/// #############################################################################################
/// TODO this macro is not the way the library is intended to be used, see https://github.com/neondatabase/neon/issues/1565 for details.
/// We use `cachepot` to reduce our current CI build times: https://github.com/neondatabase/cloud/pull/1033#issuecomment-1100935036
/// Yet, it seems to ignore the GIT_VERSION env variable, passed to Docker build, even with build.rs that contains
/// `println!("cargo:rerun-if-env-changed=GIT_VERSION");` code for cachepot cache invalidation.
/// The problem needs further investigation and regular `const` declaration instead of a macro.
#[macro_export]
macro_rules! project_git_version {
($const_identifier:ident) => {
const $const_identifier: &str = git_version::git_version!(
prefix = "git:",
fallback = concat!(
"git-env:",
env!("GIT_VERSION", "Missing GIT_VERSION envvar")
),
args = ["--abbrev=40", "--always", "--dirty=-modified"] // always use full sha
);
};
}
/// Same as `assert!`, but evaluated during compilation and gets optimized out in runtime.
#[macro_export]
macro_rules! const_assert {
($($args:tt)*) => {
const _: () = assert!($($args)*);
};
}

View File

@@ -433,7 +433,12 @@ impl PostgresBackend {
// full cause of the error, not just the top-level context + its trace.
// We don't want to send that in the ErrorResponse though,
// because it's not relevant to the compute node logs.
error!("query handler for '{}' failed: {:?}", query_string, e);
if query_string.starts_with("callmemaybe") {
// FIXME avoid printing a backtrace for tenant x not found errors until this is properly fixed
error!("query handler for '{}' failed: {}", query_string, e);
} else {
error!("query handler for '{}' failed: {:?}", query_string, e);
}
self.write_message_noflush(&BeMessage::ErrorResponse(&e.to_string()))?;
// TODO: untangle convoluted control flow
if e.to_string().contains("failed to run") {

View File

@@ -503,6 +503,18 @@ impl RowDescriptor<'_> {
formatcode: 0,
}
}
pub const fn text_col(name: &[u8]) -> RowDescriptor {
RowDescriptor {
name,
tableoid: 0,
attnum: 0,
typoid: TEXT_OID,
typlen: -1,
typmod: 0,
formatcode: 0,
}
}
}
#[derive(Debug)]

View File

@@ -1,11 +1,9 @@
use std::{alloc::Layout, cmp::Ordering, ops::RangeBounds};
use serde::{Deserialize, Serialize};
/// Ordered map datastructure implemented in a Vec.
/// Append only - can only add keys that are larger than the
/// current max key.
#[derive(Clone, Debug, Serialize, Deserialize)]
#[derive(Clone, Debug)]
pub struct VecMap<K, V>(Vec<(K, V)>);
impl<K, V> Default for VecMap<K, V> {

View File

@@ -224,7 +224,7 @@ impl fmt::Display for ZTenantTimelineId {
// Unique ID of a storage node (safekeeper or pageserver). Supposed to be issued
// by the console.
#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Debug, Serialize, Deserialize)]
#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash, Debug, Serialize, Deserialize)]
#[serde(transparent)]
pub struct ZNodeId(pub u64);

View File

@@ -1,25 +0,0 @@
version: "3"
services:
prometheus:
container_name: prometheus
image: prom/prometheus:latest
volumes:
- ./prometheus.yaml:/etc/prometheus/prometheus.yml
# ports:
# - "9090:9090"
# TODO: find a proper portable solution
network_mode: "host"
grafana:
image: grafana/grafana:latest
volumes:
- ./grafana.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_DISABLE_LOGIN_FORM=true
# ports:
# - "3000:3000"
# TODO: find a proper portable solution
network_mode: "host"

View File

@@ -1,12 +0,0 @@
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
orgId: 1
url: http://localhost:9090
basicAuth: false
isDefault: false
version: 1
editable: false

View File

@@ -1,5 +0,0 @@
scrape_configs:
- job_name: 'default'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9898']

View File

@@ -1,5 +1,5 @@
[package]
name = "zenith"
name = "neon_local"
version = "0.1.0"
edition = "2021"
@@ -7,7 +7,9 @@ edition = "2021"
clap = "3.0"
anyhow = "1.0"
serde_json = "1"
comfy-table = "5.0.1"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
git-version = "0.3.5"
# FIXME: 'pageserver' is needed for BranchInfo. Refactor
pageserver = { path = "../pageserver" }

View File

@@ -1,10 +1,10 @@
use anyhow::{anyhow, bail, Context, Result};
use clap::{Arg, ArgMatches, Command};
use clap::{App, AppSettings, Arg, ArgMatches};
use control_plane::compute::ComputeControlPlane;
use control_plane::local_env;
use control_plane::local_env::LocalEnv;
use control_plane::local_env::{EtcdBroker, LocalEnv};
use control_plane::safekeeper::SafekeeperNode;
use control_plane::storage::PageServerNode;
use control_plane::{etcd, local_env};
use pageserver::config::defaults::{
DEFAULT_HTTP_LISTEN_ADDR as DEFAULT_PAGESERVER_HTTP_ADDR,
DEFAULT_PG_LISTEN_ADDR as DEFAULT_PAGESERVER_PG_ADDR,
@@ -14,14 +14,15 @@ use safekeeper::defaults::{
DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT,
};
use std::collections::{BTreeSet, HashMap};
use std::path::{Path, PathBuf};
use std::process::exit;
use std::str::FromStr;
use utils::{
auth::{Claims, Scope},
lsn::Lsn,
postgres_backend::AuthType,
project_git_version,
zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId},
GIT_VERSION,
};
use pageserver::timelines::TimelineInfo;
@@ -30,29 +31,29 @@ use pageserver::timelines::TimelineInfo;
const DEFAULT_SAFEKEEPER_ID: ZNodeId = ZNodeId(1);
const DEFAULT_PAGESERVER_ID: ZNodeId = ZNodeId(1);
const DEFAULT_BRANCH_NAME: &str = "main";
project_git_version!(GIT_VERSION);
fn default_conf() -> String {
fn default_conf(etcd_binary_path: &Path) -> String {
format!(
r#"
# Default built-in configuration, defined in main.rs
[etcd_broker]
broker_endpoints = ['http://localhost:2379']
etcd_binary_path = '{etcd_binary_path}'
[pageserver]
id = {pageserver_id}
listen_pg_addr = '{pageserver_pg_addr}'
listen_http_addr = '{pageserver_http_addr}'
id = {DEFAULT_PAGESERVER_ID}
listen_pg_addr = '{DEFAULT_PAGESERVER_PG_ADDR}'
listen_http_addr = '{DEFAULT_PAGESERVER_HTTP_ADDR}'
auth_type = '{pageserver_auth_type}'
[[safekeepers]]
id = {safekeeper_id}
pg_port = {safekeeper_pg_port}
http_port = {safekeeper_http_port}
id = {DEFAULT_SAFEKEEPER_ID}
pg_port = {DEFAULT_SAFEKEEPER_PG_PORT}
http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT}
"#,
pageserver_id = DEFAULT_PAGESERVER_ID,
pageserver_pg_addr = DEFAULT_PAGESERVER_PG_ADDR,
pageserver_http_addr = DEFAULT_PAGESERVER_HTTP_ADDR,
etcd_binary_path = etcd_binary_path.display(),
pageserver_auth_type = AuthType::Trust,
safekeeper_id = DEFAULT_SAFEKEEPER_ID,
safekeeper_pg_port = DEFAULT_SAFEKEEPER_PG_PORT,
safekeeper_http_port = DEFAULT_SAFEKEEPER_HTTP_PORT,
)
}
@@ -62,15 +63,15 @@ http_port = {safekeeper_http_port}
struct TimelineTreeEl {
/// `TimelineInfo` received from the `pageserver` via the `timeline_list` http API call.
pub info: TimelineInfo,
/// Name, recovered from zenith config mappings
/// Name, recovered from neon config mappings
pub name: Option<String>,
/// Holds all direct children of this timeline referenced using `timeline_id`.
pub children: BTreeSet<ZTimelineId>,
}
// Main entry point for the 'zenith' CLI utility
// Main entry point for the 'neon_local' CLI utility
//
// This utility helps to manage zenith installation. That includes following:
// This utility helps to manage neon installation. That includes following:
// * Management of local postgres installations running on top of the
// pageserver.
// * Providing CLI api to the pageserver
@@ -125,12 +126,12 @@ fn main() -> Result<()> {
.takes_value(true)
.required(false);
let matches = Command::new("Zenith CLI")
.arg_required_else_help(true)
let matches = App::new("Neon CLI")
.setting(AppSettings::ArgRequiredElseHelp)
.version(GIT_VERSION)
.subcommand(
Command::new("init")
.about("Initialize a new Zenith repository")
App::new("init")
.about("Initialize a new Neon repository")
.arg(pageserver_config_args.clone())
.arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline"))
.arg(
@@ -141,12 +142,12 @@ fn main() -> Result<()> {
)
)
.subcommand(
Command::new("timeline")
App::new("timeline")
.about("Manage timelines")
.subcommand(Command::new("list")
.subcommand(App::new("list")
.about("List all timelines, available to this pageserver")
.arg(tenant_id_arg.clone()))
.subcommand(Command::new("branch")
.subcommand(App::new("branch")
.about("Create a new timeline, using another timeline as a base, copying its data")
.arg(tenant_id_arg.clone())
.arg(branch_name_arg.clone())
@@ -154,60 +155,74 @@ fn main() -> Result<()> {
.help("Use last Lsn of another timeline (and its data) as base when creating the new timeline. The timeline gets resolved by its branch name.").required(false))
.arg(Arg::new("ancestor-start-lsn").long("ancestor-start-lsn").takes_value(true)
.help("When using another timeline as base, use a specific Lsn in it instead of the latest one").required(false)))
.subcommand(Command::new("create")
.subcommand(App::new("create")
.about("Create a new blank timeline")
.arg(tenant_id_arg.clone())
.arg(branch_name_arg.clone()))
.subcommand(App::new("import")
.about("Import timeline from basebackup directory")
.arg(tenant_id_arg.clone())
.arg(timeline_id_arg.clone())
.arg(Arg::new("node-name").long("node-name").takes_value(true)
.help("Name to assign to the imported timeline"))
.arg(Arg::new("base-tarfile").long("base-tarfile").takes_value(true)
.help("Basebackup tarfile to import"))
.arg(Arg::new("base-lsn").long("base-lsn").takes_value(true)
.help("Lsn the basebackup starts at"))
.arg(Arg::new("wal-tarfile").long("wal-tarfile").takes_value(true)
.help("Wal to add after base"))
.arg(Arg::new("end-lsn").long("end-lsn").takes_value(true)
.help("Lsn the basebackup ends at")))
).subcommand(
Command::new("tenant")
.arg_required_else_help(true)
App::new("tenant")
.setting(AppSettings::ArgRequiredElseHelp)
.about("Manage tenants")
.subcommand(Command::new("list"))
.subcommand(Command::new("create")
.subcommand(App::new("list"))
.subcommand(App::new("create")
.arg(tenant_id_arg.clone())
.arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline"))
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
.subcommand(Command::new("config")
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
.subcommand(App::new("config")
.arg(tenant_id_arg.clone())
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
)
.subcommand(
Command::new("pageserver")
.arg_required_else_help(true)
App::new("pageserver")
.setting(AppSettings::ArgRequiredElseHelp)
.about("Manage pageserver")
.subcommand(Command::new("status"))
.subcommand(Command::new("start").about("Start local pageserver").arg(pageserver_config_args.clone()))
.subcommand(Command::new("stop").about("Stop local pageserver")
.subcommand(App::new("status"))
.subcommand(App::new("start").about("Start local pageserver").arg(pageserver_config_args.clone()))
.subcommand(App::new("stop").about("Stop local pageserver")
.arg(stop_mode_arg.clone()))
.subcommand(Command::new("restart").about("Restart local pageserver").arg(pageserver_config_args.clone()))
.subcommand(App::new("restart").about("Restart local pageserver").arg(pageserver_config_args.clone()))
)
.subcommand(
Command::new("safekeeper")
.arg_required_else_help(true)
App::new("safekeeper")
.setting(AppSettings::ArgRequiredElseHelp)
.about("Manage safekeepers")
.subcommand(Command::new("start")
.subcommand(App::new("start")
.about("Start local safekeeper")
.arg(safekeeper_id_arg.clone())
)
.subcommand(Command::new("stop")
.subcommand(App::new("stop")
.about("Stop local safekeeper")
.arg(safekeeper_id_arg.clone())
.arg(stop_mode_arg.clone())
)
.subcommand(Command::new("restart")
.subcommand(App::new("restart")
.about("Restart local safekeeper")
.arg(safekeeper_id_arg.clone())
.arg(stop_mode_arg.clone())
)
)
.subcommand(
Command::new("pg")
.arg_required_else_help(true)
App::new("pg")
.setting(AppSettings::ArgRequiredElseHelp)
.about("Manage postgres instances")
.subcommand(Command::new("list").arg(tenant_id_arg.clone()))
.subcommand(Command::new("create")
.subcommand(App::new("list").arg(tenant_id_arg.clone()))
.subcommand(App::new("create")
.about("Create a postgres compute node")
.arg(pg_node_arg.clone())
.arg(branch_name_arg.clone())
@@ -220,7 +235,7 @@ fn main() -> Result<()> {
.long("config-only")
.required(false)
))
.subcommand(Command::new("start")
.subcommand(App::new("start")
.about("Start a postgres compute node.\n This command actually creates new node from scratch, but preserves existing config files")
.arg(pg_node_arg.clone())
.arg(tenant_id_arg.clone())
@@ -229,7 +244,7 @@ fn main() -> Result<()> {
.arg(lsn_arg.clone())
.arg(port_arg.clone()))
.subcommand(
Command::new("stop")
App::new("stop")
.arg(pg_node_arg.clone())
.arg(tenant_id_arg.clone())
.arg(
@@ -242,12 +257,12 @@ fn main() -> Result<()> {
)
.subcommand(
Command::new("start")
App::new("start")
.about("Start page server and safekeepers")
.arg(pageserver_config_args)
)
.subcommand(
Command::new("stop")
App::new("stop")
.about("Stop page server and safekeepers")
.arg(stop_mode_arg.clone())
)
@@ -258,7 +273,7 @@ fn main() -> Result<()> {
None => bail!("no subcommand provided"),
};
// Check for 'zenith init' command first.
// Check for 'neon init' command first.
let subcommand_result = if sub_name == "init" {
handle_init(sub_args).map(Some)
} else {
@@ -274,7 +289,7 @@ fn main() -> Result<()> {
"pageserver" => handle_pageserver(sub_args, &env),
"pg" => handle_pg(sub_args, &env),
"safekeeper" => handle_safekeeper(sub_args, &env),
_ => bail!("unexpected subcommand {}", sub_name),
_ => bail!("unexpected subcommand {sub_name}"),
};
if original_env != env {
@@ -288,7 +303,7 @@ fn main() -> Result<()> {
Ok(Some(updated_env)) => updated_env.persist_config(&updated_env.base_data_dir)?,
Ok(None) => (),
Err(e) => {
eprintln!("command failed: {:?}", e);
eprintln!("command failed: {e:?}");
exit(1);
}
}
@@ -467,23 +482,22 @@ fn parse_timeline_id(sub_match: &ArgMatches) -> anyhow::Result<Option<ZTimelineI
.context("Failed to parse timeline id from the argument string")
}
fn handle_init(init_match: &ArgMatches) -> Result<LocalEnv> {
fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
let initial_timeline_id_arg = parse_timeline_id(init_match)?;
// Create config file
let toml_file: String = if let Some(config_path) = init_match.value_of("config") {
// load and parse the file
std::fs::read_to_string(std::path::Path::new(config_path))
.with_context(|| format!("Could not read configuration file \"{}\"", config_path))?
.with_context(|| format!("Could not read configuration file '{config_path}'"))?
} else {
// Built-in default config
default_conf()
default_conf(&EtcdBroker::locate_etcd()?)
};
let mut env =
LocalEnv::create_config(&toml_file).context("Failed to create zenith configuration")?;
env.init()
.context("Failed to initialize zenith repository")?;
LocalEnv::parse_config(&toml_file).context("Failed to create neon configuration")?;
env.init().context("Failed to initialize neon repository")?;
// default_tenantid was generated by the `env.init()` call above
let initial_tenant_id = env.default_tenant_id.unwrap();
@@ -497,7 +511,7 @@ fn handle_init(init_match: &ArgMatches) -> Result<LocalEnv> {
&pageserver_config_overrides(init_match),
)
.unwrap_or_else(|e| {
eprintln!("pageserver init failed: {}", e);
eprintln!("pageserver init failed: {e}");
exit(1);
});
@@ -518,7 +532,7 @@ fn pageserver_config_overrides(init_match: &ArgMatches) -> Vec<&str> {
.collect()
}
fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Result<()> {
fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> anyhow::Result<()> {
let pageserver = PageServerNode::from_env(env);
match tenant_match.subcommand() {
Some(("list", _)) => {
@@ -541,6 +555,29 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Re
"tenant {} successfully created on the pageserver",
new_tenant_id
);
// Create an initial timeline for the new tenant
let new_timeline_id = parse_timeline_id(create_match)?;
let timeline = pageserver
.timeline_create(new_tenant_id, new_timeline_id, None, None)?
.context(format!(
"Failed to create initial timeline for tenant {new_tenant_id}"
))?;
let new_timeline_id = timeline.timeline_id;
let last_record_lsn = timeline
.local
.context(format!("Failed to get last record LSN: no local timeline info for timeline {new_timeline_id}"))?
.last_record_lsn;
env.register_branch_mapping(
DEFAULT_BRANCH_NAME.to_string(),
new_tenant_id,
new_timeline_id,
)?;
println!(
"Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {new_tenant_id}",
);
}
Some(("config", create_match)) => {
let tenant_id = get_tenant_id(create_match, env)?;
@@ -551,17 +588,8 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Re
pageserver
.tenant_config(tenant_id, tenant_conf)
.unwrap_or_else(|e| {
anyhow!(
"Tenant config failed for tenant with id {} : {}",
tenant_id,
e
);
});
println!(
"tenant {} successfully configured on the pageserver",
tenant_id
);
.with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?;
println!("tenant {tenant_id} successfully configured on the pageserver");
}
Some((sub_name, _)) => bail!("Unexpected tenant subcommand '{}'", sub_name),
None => bail!("no tenant subcommand provided"),
@@ -599,6 +627,43 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
timeline.timeline_id, last_record_lsn, tenant_id,
);
}
Some(("import", import_match)) => {
let tenant_id = get_tenant_id(import_match, env)?;
let timeline_id = parse_timeline_id(import_match)?.expect("No timeline id provided");
let name = import_match
.value_of("node-name")
.ok_or_else(|| anyhow!("No node name provided"))?;
// Parse base inputs
let base_tarfile = import_match
.value_of("base-tarfile")
.map(|s| PathBuf::from_str(s).unwrap())
.ok_or_else(|| anyhow!("No base-tarfile provided"))?;
let base_lsn = Lsn::from_str(
import_match
.value_of("base-lsn")
.ok_or_else(|| anyhow!("No base-lsn provided"))?,
)?;
let base = (base_lsn, base_tarfile);
// Parse pg_wal inputs
let wal_tarfile = import_match
.value_of("wal-tarfile")
.map(|s| PathBuf::from_str(s).unwrap());
let end_lsn = import_match
.value_of("end-lsn")
.map(|s| Lsn::from_str(s).unwrap());
// TODO validate both or none are provided
let pg_wal = end_lsn.zip(wal_tarfile);
let mut cplane = ComputeControlPlane::load(env.clone())?;
println!("Importing timeline into pageserver ...");
pageserver.timeline_import(tenant_id, timeline_id, base, pg_wal)?;
println!("Creating node for imported timeline ...");
env.register_branch_mapping(name.to_string(), tenant_id, timeline_id)?;
cplane.new_node(tenant_id, name, timeline_id, None, None)?;
println!("Done");
}
Some(("branch", branch_match)) => {
let tenant_id = get_tenant_id(branch_match, env)?;
let new_branch_name = branch_match
@@ -665,35 +730,56 @@ fn handle_pg(pg_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let timeline_name_mappings = env.timeline_name_mappings();
println!("NODE\tADDRESS\tTIMELINE\tBRANCH NAME\tLSN\t\tSTATUS");
let mut table = comfy_table::Table::new();
table.load_preset(comfy_table::presets::NOTHING);
table.set_header(&[
"NODE",
"ADDRESS",
"TIMELINE",
"BRANCH NAME",
"LSN",
"STATUS",
]);
for ((_, node_name), node) in cplane
.nodes
.iter()
.filter(|((node_tenant_id, _), _)| node_tenant_id == &tenant_id)
{
// FIXME: This shows the LSN at the end of the timeline. It's not the
// right thing to do for read-only nodes that might be anchored at an
// older point in time, or following but lagging behind the primary.
let lsn_str = timeline_infos
.get(&node.timeline_id)
.and_then(|bi| bi.local.as_ref().map(|l| l.last_record_lsn.to_string()))
.unwrap_or_else(|| "?".to_string());
let lsn_str = match node.lsn {
None => {
// -> primary node
// Use the LSN at the end of the timeline.
timeline_infos
.get(&node.timeline_id)
.and_then(|bi| bi.local.as_ref().map(|l| l.last_record_lsn.to_string()))
.unwrap_or_else(|| "?".to_string())
}
Some(lsn) => {
// -> read-only node
// Use the node's LSN.
lsn.to_string()
}
};
let branch_name = timeline_name_mappings
.get(&ZTenantTimelineId::new(tenant_id, node.timeline_id))
.map(|name| name.as_str())
.unwrap_or("?");
println!(
"{}\t{}\t{}\t{}\t{}\t{}",
node_name,
node.address,
node.timeline_id,
table.add_row(&[
node_name.as_str(),
&node.address.to_string(),
&node.timeline_id.to_string(),
branch_name,
lsn_str,
lsn_str.as_str(),
node.status(),
);
]);
}
println!("{table}");
}
"create" => {
let branch_name = sub_args
@@ -885,20 +971,23 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
Ok(())
}
fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::Result<()> {
etcd::start_etcd_process(env)?;
let pageserver = PageServerNode::from_env(env);
// Postgres nodes are not started automatically
if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) {
eprintln!("pageserver start failed: {}", e);
eprintln!("pageserver start failed: {e}");
try_stop_etcd_process(env);
exit(1);
}
for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start() {
eprintln!("safekeeper '{}' start failed: {}", safekeeper.id, e);
eprintln!("safekeeper '{}' start failed: {e}", safekeeper.id);
try_stop_etcd_process(env);
exit(1);
}
}
@@ -928,5 +1017,14 @@ fn handle_stop_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<
eprintln!("safekeeper '{}' stop failed: {}", safekeeper.id, e);
}
}
try_stop_etcd_process(env);
Ok(())
}
fn try_stop_etcd_process(env: &local_env::LocalEnv) {
if let Err(e) = etcd::stop_etcd_process(env) {
eprintln!("etcd stop failed: {e}");
}
}

View File

@@ -4,8 +4,12 @@ version = "0.1.0"
edition = "2021"
[features]
default = []
# It is simpler infra-wise to have failpoints enabled by default
# It shouldn't affect perf in any way because failpoints
# are not placed in hot code paths
default = ["failpoints"]
profiling = ["pprof"]
failpoints = ["fail/failpoints"]
[dependencies]
chrono = "0.4.19"
@@ -18,10 +22,9 @@ hex = "0.4.3"
hyper = "0.14"
itertools = "0.10.3"
lazy_static = "1.4.0"
clap = { version = "3.1.8", features = ["derive"] }
clap = "3.0"
daemonize = "0.4.1"
tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
tokio-util = { version = "0.7", features = ["io"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
@@ -35,6 +38,7 @@ humantime = "2.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "1.12.0"
humantime-serde = "1.1.1"
pprof = { git = "https://github.com/neondatabase/pprof-rs.git", branch = "wallclock-profiling", features = ["flamegraph"], optional = true }
@@ -48,15 +52,16 @@ nix = "0.23"
once_cell = "1.8.0"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
rusoto_core = "0.47"
rusoto_s3 = "0.47"
async-trait = "0.1"
git-version = "0.3.5"
postgres_ffi = { path = "../libs/postgres_ffi" }
etcd_broker = { path = "../libs/etcd_broker" }
metrics = { path = "../libs/metrics" }
utils = { path = "../libs/utils" }
remote_storage = { path = "../libs/remote_storage" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
close_fds = "0.3.2"
walkdir = "2.3.2"
[dev-dependencies]
hex-literal = "0.3"

View File

@@ -135,7 +135,7 @@ The backup service is disabled by default and can be enabled to interact with a
CLI examples:
* Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"`
* AWS S3 : `${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/',access_key_id='SOMEKEYAAAAASADSAH*#',secret_access_key='SOMEsEcReTsd292v'}"`
* AWS S3 : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"`
For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
For local S3 installations, refer to the their documentation for name format and credentials.
@@ -155,11 +155,9 @@ or
bucket_name = 'some-sample-bucket'
bucket_region = 'eu-north-1'
prefix_in_bucket = '/test_prefix/'
access_key_id = 'SOMEKEYAAAAASADSAH*#'
secret_access_key = 'SOMEsEcReTsd292v'
```
Also, `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` variables can be used to specify the credentials instead of any of the ways above.
`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed.
TODO: Sharding
--------------------

View File

@@ -10,7 +10,7 @@
//! This module is responsible for creation of such tarball
//! from data stored in object storage.
//!
use anyhow::{ensure, Context, Result};
use anyhow::{anyhow, ensure, Context, Result};
use bytes::{BufMut, BytesMut};
use std::fmt::Write as FmtWrite;
use std::io;
@@ -97,7 +97,9 @@ impl<'a> Basebackup<'a> {
})
}
pub fn send_tarball(&mut self) -> anyhow::Result<()> {
pub fn send_tarball(mut self) -> anyhow::Result<()> {
// TODO include checksum
// Create pgdata subdirs structure
for dir in pg_constants::PGDATA_SUBDIRS.iter() {
let header = new_tar_header_dir(*dir)?;
@@ -154,9 +156,17 @@ impl<'a> Basebackup<'a> {
let img = self
.timeline
.get_slru_page_at_lsn(slru, segno, blknum, self.lsn)?;
ensure!(img.len() == pg_constants::BLCKSZ as usize);
slru_buf.extend_from_slice(&img);
if slru == SlruKind::Clog {
ensure!(
img.len() == pg_constants::BLCKSZ as usize
|| img.len() == pg_constants::BLCKSZ as usize + 8
);
} else {
ensure!(img.len() == pg_constants::BLCKSZ as usize);
}
slru_buf.extend_from_slice(&img[..pg_constants::BLCKSZ as usize]);
}
let segname = format!("{}/{:>04X}", slru.to_str(), segno);
@@ -315,7 +325,8 @@ impl<'a> Basebackup<'a> {
let wal_file_name = XLogFileName(PG_TLI, segno, pg_constants::WAL_SEGMENT_SIZE);
let wal_file_path = format!("pg_wal/{}", wal_file_name);
let header = new_tar_header(&wal_file_path, pg_constants::WAL_SEGMENT_SIZE as u64)?;
let wal_seg = generate_wal_segment(segno, pg_control.system_identifier);
let wal_seg = generate_wal_segment(segno, pg_control.system_identifier)
.map_err(|e| anyhow!(e).context("Failed generating wal segment"))?;
ensure!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE);
self.ar.append(&header, &wal_seg[..])?;
Ok(())

View File

@@ -2,15 +2,17 @@
//!
//! A handy tool for debugging, that's all.
use anyhow::Result;
use clap::{Arg, Command};
use clap::{App, Arg};
use pageserver::layered_repository::dump_layerfile_from_path;
use pageserver::page_cache;
use pageserver::virtual_file;
use std::path::PathBuf;
use utils::GIT_VERSION;
use utils::project_git_version;
project_git_version!(GIT_VERSION);
fn main() -> Result<()> {
let arg_matches = Command::new("Zenith dump_layerfile utility")
let arg_matches = App::new("Zenith dump_layerfile utility")
.about("Dump contents of one layer file, for debugging")
.version(GIT_VERSION)
.arg(

View File

@@ -5,15 +5,13 @@ use tracing::*;
use anyhow::{bail, Context, Result};
use clap::{Arg, Command};
use clap::{App, Arg};
use daemonize::Daemonize;
use fail::FailScenario;
use pageserver::{
config::{defaults::*, PageServerConf},
http, page_cache, page_service, profiling,
remote_storage::{self, SyncStartupData},
repository::{Repository, TimelineSyncStatusUpdate},
tenant_mgr, thread_mgr,
http, page_cache, page_service, profiling, tenant_mgr, thread_mgr,
thread_mgr::ThreadKind,
timelines, virtual_file, LOG_FILE_NAME,
};
@@ -22,20 +20,25 @@ use utils::{
http::endpoint,
logging,
postgres_backend::AuthType,
project_git_version,
shutdown::exit_now,
signals::{self, Signal},
tcp_listener,
zid::{ZTenantId, ZTimelineId},
GIT_VERSION,
};
project_git_version!(GIT_VERSION);
fn version() -> String {
format!("{} profiling:{}", GIT_VERSION, cfg!(feature = "profiling"))
format!(
"{GIT_VERSION} profiling:{} failpoints:{}",
cfg!(feature = "profiling"),
fail::has_failpoints()
)
}
fn main() -> anyhow::Result<()> {
metrics::set_common_metrics_prefix("pageserver");
let arg_matches = Command::new("Zenith page server")
let arg_matches = App::new("Zenith page server")
.about("Materializes WAL stream to pages and serves them to the postgres")
.version(&*version())
.arg(
@@ -82,8 +85,25 @@ fn main() -> anyhow::Result<()> {
.help("Additional configuration overrides of the ones from the toml config file (or new ones to add there).
Any option has to be a valid toml document, example: `-c=\"foo='hey'\"` `-c=\"foo={value=1}\"`"),
)
.arg(
Arg::new("enabled-features")
.long("enabled-features")
.takes_value(false)
.help("Show enabled compile time features"),
)
.get_matches();
if arg_matches.is_present("enabled-features") {
let features: &[&str] = &[
#[cfg(feature = "failpoints")]
"failpoints",
#[cfg(feature = "profiling")]
"profiling",
];
println!("{{\"features\": {features:?} }}");
return Ok(());
}
let workdir = Path::new(arg_matches.value_of("workdir").unwrap_or(".zenith"));
let workdir = workdir
.canonicalize()
@@ -164,6 +184,9 @@ fn main() -> anyhow::Result<()> {
// as a ref.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
// Initialize up failpoints support
let scenario = FailScenario::setup();
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
page_cache::init(conf.page_cache_size);
@@ -179,20 +202,19 @@ fn main() -> anyhow::Result<()> {
cfg_file_path.display()
)
})?;
Ok(())
} else {
start_pageserver(conf, daemonize).context("Failed to start pageserver")
start_pageserver(conf, daemonize).context("Failed to start pageserver")?;
}
scenario.teardown();
Ok(())
}
fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()> {
// Initialize logger
let log_file = logging::init(LOG_FILE_NAME, daemonize)?;
// Initialize wal metadata logger, if necessary
pageserver::wal_metadata::init(conf).expect("wal_metadata init failed");
info!("version: {}", GIT_VERSION);
info!("version: {GIT_VERSION}");
// TODO: Check that it looks like a valid repository before going further
@@ -232,53 +254,14 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
// Otherwise, the coverage data will be damaged.
match daemonize.exit_action(|| exit_now(0)).start() {
Ok(_) => info!("Success, daemonized"),
Err(err) => error!(%err, "could not daemonize"),
Err(err) => bail!("{err}. could not daemonize. bailing."),
}
}
let signals = signals::install_shutdown_handlers()?;
// Initialize repositories with locally available timelines.
// Timelines that are only partially available locally (remote storage has more data than this pageserver)
// are scheduled for download and added to the repository once download is completed.
let SyncStartupData {
remote_index,
local_timeline_init_statuses,
} = remote_storage::start_local_timeline_sync(conf)
.context("Failed to set up local files sync with external storage")?;
for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses {
// initialize local tenant
let repo = tenant_mgr::load_local_repo(conf, tenant_id, &remote_index)
.with_context(|| format!("Failed to load repo for tenant {}", tenant_id))?;
for (timeline_id, init_status) in local_timeline_init_statuses {
match init_status {
remote_storage::LocalTimelineInitStatus::LocallyComplete => {
debug!("timeline {} for tenant {} is locally complete, registering it in repository", timeline_id, tenant_id);
// Lets fail here loudly to be on the safe side.
// XXX: It may be a better api to actually distinguish between repository startup
// and processing of newly downloaded timelines.
repo.apply_timeline_remote_sync_status_update(
timeline_id,
TimelineSyncStatusUpdate::Downloaded,
)
.with_context(|| {
format!(
"Failed to bootstrap timeline {} for tenant {}",
timeline_id, tenant_id
)
})?
}
remote_storage::LocalTimelineInitStatus::NeedsSync => {
debug!(
"timeline {} for tenant {} needs sync, \
so skipped for adding into repository until sync is finished",
tenant_id, timeline_id
);
}
}
}
}
// start profiler (if enabled)
let profiler_guard = profiling::init_profiler(conf);
// initialize authentication for incoming connections
let auth = match &conf.auth_type {
@@ -291,8 +274,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
};
info!("Using auth: {:#?}", conf.auth_type);
// start profiler (if enabled)
let profiler_guard = profiling::init_profiler(conf);
let remote_index = tenant_mgr::init_tenant_mgr(conf)?;
// Spawn a new thread for the http endpoint
// bind before launching separate thread so the error reported before startup exits
@@ -302,7 +284,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
None,
None,
"http_endpoint_thread",
false,
true,
move || {
let router = http::make_router(conf, auth_cloned, remote_index)?;
endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher())
@@ -316,7 +298,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
None,
None,
"libpq endpoint thread",
false,
true,
move || page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type),
)?;
@@ -336,7 +318,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
signal.name()
);
profiling::exit_profiler(conf, &profiler_guard);
pageserver::shutdown_pageserver();
pageserver::shutdown_pageserver(0);
unreachable!()
}
})

View File

@@ -1,265 +0,0 @@
//! Pageserver benchmark tool
//!
//! This tool connects directly to a pageserver, issues queries and measures performance.
//!
//! Ideally the tool would be ablle to stream WAL into the pageserver, and (possibly
//! simultaneously) make read requests. Currently wal streaming is not implemented,
//! so this tool assumes the pageserver is prepopulated with some data, and only
//! issues read queries. It also currently assumes that the pageserver writes out some
//! metadata describing the write access pattern on the workload that was performed on it.
//! See the python tests that use psbench for usage example.
//!
//! This tool runs a variety of workloads. See the PsbenchTest enum below, or run the tool
//! with --help to see the available workloads.
//!
use bytes::{BufMut, BytesMut};
use clap::{Parser, Subcommand};
use futures::future;
use pageserver::wal_metadata::{Page, WalEntryMetadata};
use postgres_ffi::pg_constants::BLCKSZ;
use rand::seq::SliceRandom;
use rand::thread_rng;
use std::fs::File;
use std::path::{Path, PathBuf};
use std::time::Instant;
use std::{
collections::HashSet,
io::{BufRead, BufReader, Cursor},
time::Duration,
};
use tokio::io::{AsyncReadExt, AsyncWriteExt};
use tokio::net::TcpStream;
use utils::zid::{ZTenantId, ZTimelineId};
use utils::{
lsn::Lsn,
pq_proto::{BeMessage, FeMessage},
};
use anyhow::Result;
/// Client for the pageserver's pagestream API
struct PagestreamApi {
stream: TcpStream,
}
/// Good enough implementation for these tests
impl PagestreamApi {
async fn connect(tenant: &ZTenantId, timeline: &ZTimelineId) -> Result<PagestreamApi> {
let mut stream = TcpStream::connect("localhost:15000").await?;
// Connect to pageserver
// TODO read host, port, dbname, user from command line
let (client, conn) = tokio_postgres::Config::new()
.host("127.0.0.1")
.port(15000)
.dbname("postgres")
.user("zenith_admin")
.connect_raw(&mut stream, tokio_postgres::NoTls)
.await?;
// Enter pagestream protocol
let init_query = format!("pagestream {} {}", tenant, timeline);
tokio::select! {
_ = conn => panic!("connection closed during pagestream initialization"),
_ = client.query(init_query.as_str(), &[]) => (),
};
Ok(PagestreamApi { stream })
}
async fn get_page(&mut self, lsn: &Lsn, page: &Page, latest: bool) -> anyhow::Result<Vec<u8>> {
let latest: u8 = if latest { 1 } else { 0 };
let msg = {
let query = {
let mut query = BytesMut::new();
query.put_u8(2); // Specifies get_page query
query.put_u8(latest);
query.put_u64(lsn.0);
page.write(&mut query).await?;
query.freeze()
};
let mut buf = BytesMut::new();
let copy_msg = BeMessage::CopyData(&query);
BeMessage::write(&mut buf, &copy_msg)?;
buf.freeze()
};
self.stream.write_all(&msg).await?;
let response = match FeMessage::read_fut(&mut self.stream).await? {
Some(FeMessage::CopyData(page)) => page,
r => panic!("Expected CopyData message, got: {:?}", r),
};
let page = {
let mut cursor = Cursor::new(response);
let tag = AsyncReadExt::read_u8(&mut cursor).await?;
match tag {
102 => {
let mut page = Vec::<u8>::new();
cursor.read_to_end(&mut page).await?;
if page.len() != (BLCKSZ as usize) {
panic!("Expected 8kb page, got: {:?}", page.len());
}
page
}
103 => {
let mut bytes = Vec::<u8>::new();
cursor.read_to_end(&mut bytes).await?;
let message = String::from_utf8(bytes)?;
panic!("Got error message: {}", message);
}
_ => panic!("Unhandled tag {:?}", tag),
}
};
Ok(page)
}
}
/// Parsed wal_metadata file with additional derived
/// statistics for convenience.
#[derive(Clone)]
struct Metadata {
// Parsed from metadata file
wal_metadata: Vec<WalEntryMetadata>,
// Derived from wal_metadata
total_wal_size: usize,
affected_pages: HashSet<Page>,
latest_lsn: Lsn,
}
impl Metadata {
/// Construct metadata object from wal_metadata file emitted by pageserver
fn build(wal_metadata_path: &Path) -> Result<Metadata> {
let wal_metadata_file = File::open(wal_metadata_path).expect("error opening wal_metadata");
let wal_metadata: Vec<WalEntryMetadata> = BufReader::new(wal_metadata_file)
.lines()
.map(|result| result.expect("error reading from file"))
.map(|line| serde_json::from_str(&line).expect("corrupt metadata file"))
.collect();
let total_wal_size: usize = wal_metadata.iter().map(|m| m.size).sum();
let affected_pages: HashSet<_> = wal_metadata
.iter()
.flat_map(|m| m.affected_pages.clone())
.collect();
let latest_lsn = wal_metadata.iter().map(|m| m.lsn).max().unwrap();
Ok(Metadata {
wal_metadata,
total_wal_size,
affected_pages,
latest_lsn,
})
}
/// Print results in a format readable by benchmark_fixture.py
fn report_latency(&self, latencies: &[Duration]) -> Result<()> {
let mut latencies: Vec<&Duration> = latencies.iter().collect();
latencies.sort();
println!("test_param num_pages {}", self.affected_pages.len());
println!("test_param num_wal_entries {}", self.wal_metadata.len());
println!("test_param total_wal_size {} bytes", self.total_wal_size);
println!(
"lower_is_better fastest {:?} microseconds",
latencies.first().unwrap().as_micros()
);
println!(
"lower_is_better median {:?} microseconds",
latencies[latencies.len() / 2].as_micros()
);
println!(
"lower_is_better average {:.2} microseconds",
(latencies.iter().map(|l| l.as_micros()).sum::<u128>() as f64)
/ (latencies.len() as f64)
);
println!(
"lower_is_better p99 {:?} microseconds",
latencies[latencies.len() - 1 - latencies.len() / 100].as_micros()
);
println!(
"lower_is_better slowest {:?} microseconds",
latencies.last().unwrap().as_micros()
);
Ok(())
}
}
/// Sequentially get the latest version of each page and report latencies
async fn test_latest_pages(
tenant: &ZTenantId,
timeline: &ZTimelineId,
metadata: &Metadata,
) -> Result<Vec<Duration>> {
let mut api = PagestreamApi::connect(tenant, timeline).await.unwrap();
let mut latencies: Vec<Duration> = vec![];
let mut page_order: Vec<&Page> = metadata.affected_pages.iter().collect();
page_order.shuffle(&mut thread_rng());
for page in page_order {
let start = Instant::now();
let _page_bytes = api.get_page(&metadata.latest_lsn, page, true).await?;
let duration = start.elapsed();
latencies.push(duration);
}
Ok(latencies)
}
#[derive(Parser, Debug, Clone)]
struct Args {
// TODO maybe one metadata file per timeline?
wal_metadata_path: PathBuf,
// TODO get these from wal metadata
tenant: ZTenantId,
timeline: ZTimelineId,
// TODO change to `clients_per_timeline`
#[clap(long, default_value = "1")]
num_clients: usize,
#[clap(subcommand)]
test: PsbenchTest,
}
#[derive(Subcommand, Debug, Clone)]
enum PsbenchTest {
/// Query the latest version of each page, in a random sequential order.
/// If multiple clients are used, all clients will independently query
/// every page in a different random order.
GetLatestPages,
// TODO add more tests:
// - Query with realistic read pattern
// - Query every page after every change of that page
// - Query all pages at given point in time
// - etc.
}
#[tokio::main]
async fn main() -> Result<()> {
let args = Args::parse();
let metadata = Metadata::build(&args.wal_metadata_path).unwrap();
let latencies: Vec<Duration> = match args.test {
PsbenchTest::GetLatestPages => {
// TODO explicitly spawn a thread for each?
future::join_all(
(0..args.num_clients)
.map(|_| test_latest_pages(&args.tenant, &args.timeline, &metadata)),
)
.await
.into_iter()
.flat_map(|v| v.unwrap())
.collect()
}
};
println!("test_param num_clients {}", args.num_clients);
metadata.report_latency(&latencies).unwrap();
Ok(())
}

View File

@@ -2,14 +2,16 @@
//!
//! A handy tool for debugging, that's all.
use anyhow::Result;
use clap::{Arg, Command};
use clap::{App, Arg};
use pageserver::layered_repository::metadata::TimelineMetadata;
use std::path::PathBuf;
use std::str::FromStr;
use utils::{lsn::Lsn, GIT_VERSION};
use utils::{lsn::Lsn, project_git_version};
project_git_version!(GIT_VERSION);
fn main() -> Result<()> {
let arg_matches = Command::new("Zenith update metadata utility")
let arg_matches = App::new("Zenith update metadata utility")
.about("Dump or update metadata file")
.version(GIT_VERSION)
.arg(

View File

@@ -5,6 +5,7 @@
//! See also `settings.md` for better description on every parameter.
use anyhow::{anyhow, bail, ensure, Context, Result};
use remote_storage::{RemoteStorageConfig, RemoteStorageKind, S3Config};
use std::env;
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::{Path, PathBuf};
@@ -12,6 +13,7 @@ use std::str::FromStr;
use std::time::Duration;
use toml_edit;
use toml_edit::{Document, Item};
use url::Url;
use utils::{
postgres_backend::AuthType,
zid::{ZNodeId, ZTenantId, ZTimelineId},
@@ -33,18 +35,6 @@ pub mod defaults {
pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s";
pub const DEFAULT_SUPERUSER: &str = "zenith_admin";
/// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage.
/// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency
/// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach.
/// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed.
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC: usize = 50;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
/// Currently, sync happens with AWS S3, that has two limits on requests per second:
/// ~200 RPS for IAM services
/// https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html
/// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests
/// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100;
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100;
@@ -75,6 +65,7 @@ pub mod defaults {
#gc_period = '{DEFAULT_GC_PERIOD}'
#gc_horizon = {DEFAULT_GC_HORIZON}
#image_creation_threshold = {DEFAULT_IMAGE_CREATION_THRESHOLD}
#pitr_interval = '{DEFAULT_PITR_INTERVAL}'
# [remote_storage]
@@ -119,9 +110,15 @@ pub struct PageServerConf {
pub auth_validation_public_key_path: Option<PathBuf>,
pub remote_storage_config: Option<RemoteStorageConfig>,
pub emit_wal_metadata: bool,
pub profiling: ProfilingConfig,
pub default_tenant_conf: TenantConf,
/// A prefix to add in etcd brokers before every key.
/// Can be used for isolating different pageserver groups withing the same etcd cluster.
pub broker_etcd_prefix: String,
/// Etcd broker endpoints to connect to.
pub broker_endpoints: Vec<Url>,
}
#[derive(Debug, Clone, PartialEq, Eq)]
@@ -185,8 +182,9 @@ struct PageServerConfigBuilder {
id: BuilderValue<ZNodeId>,
emit_wal_metadata: BuilderValue<bool>,
profiling: BuilderValue<ProfilingConfig>,
broker_etcd_prefix: BuilderValue<String>,
broker_endpoints: BuilderValue<Vec<Url>>,
}
impl Default for PageServerConfigBuilder {
@@ -211,8 +209,9 @@ impl Default for PageServerConfigBuilder {
auth_validation_public_key_path: Set(None),
remote_storage_config: Set(None),
id: NotSet,
emit_wal_metadata: Set(false),
profiling: Set(ProfilingConfig::Disabled),
broker_etcd_prefix: Set(etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string()),
broker_endpoints: Set(Vec::new()),
}
}
}
@@ -269,19 +268,27 @@ impl PageServerConfigBuilder {
self.remote_storage_config = BuilderValue::Set(remote_storage_config)
}
pub fn id(&mut self, node_id: ZNodeId) {
self.id = BuilderValue::Set(node_id)
pub fn broker_endpoints(&mut self, broker_endpoints: Vec<Url>) {
self.broker_endpoints = BuilderValue::Set(broker_endpoints)
}
pub fn emit_wal_metadata(&mut self, value: bool) {
self.emit_wal_metadata = BuilderValue::Set(value)
pub fn broker_etcd_prefix(&mut self, broker_etcd_prefix: String) {
self.broker_etcd_prefix = BuilderValue::Set(broker_etcd_prefix)
}
pub fn id(&mut self, node_id: ZNodeId) {
self.id = BuilderValue::Set(node_id)
}
pub fn profiling(&mut self, profiling: ProfilingConfig) {
self.profiling = BuilderValue::Set(profiling)
}
pub fn build(self) -> Result<PageServerConf> {
pub fn build(self) -> anyhow::Result<PageServerConf> {
let broker_endpoints = self
.broker_endpoints
.ok_or(anyhow!("No broker endpoints provided"))?;
Ok(PageServerConf {
listen_pg_addr: self
.listen_pg_addr
@@ -314,77 +321,17 @@ impl PageServerConfigBuilder {
.remote_storage_config
.ok_or(anyhow!("missing remote_storage_config"))?,
id: self.id.ok_or(anyhow!("missing id"))?,
emit_wal_metadata: self
.emit_wal_metadata
.ok_or(anyhow!("emit_wal_metadata not specifiec"))?,
profiling: self.profiling.ok_or(anyhow!("missing profiling"))?,
// TenantConf is handled separately
default_tenant_conf: TenantConf::default(),
broker_endpoints,
broker_etcd_prefix: self
.broker_etcd_prefix
.ok_or(anyhow!("missing broker_etcd_prefix"))?,
})
}
}
/// External backup storage configuration, enough for creating a client for that storage.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between pageserver and the remote storage.
pub max_concurrent_timelines_sync: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration.
pub storage: RemoteStorageKind,
}
/// A kind of a remote storage to connect to, with its connection configuration.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum RemoteStorageKind {
/// Storage based on local file system.
/// Specify a root folder to place all stored files into.
LocalFs(PathBuf),
/// AWS S3 based storage, storing all files in the S3 bucket
/// specified by the config
AwsS3(S3Config),
}
/// AWS S3 bucket coordinates and access credentials to manage the bucket contents (read and write).
#[derive(Clone, PartialEq, Eq)]
pub struct S3Config {
/// Name of the bucket to connect to.
pub bucket_name: String,
/// The region where the bucket is located at.
pub bucket_region: String,
/// A "subfolder" in the bucket, to use the same bucket separately by multiple pageservers at once.
pub prefix_in_bucket: Option<String>,
/// "Login" to use when connecting to bucket.
/// Can be empty for cases like AWS k8s IAM
/// where we can allow certain pods to connect
/// to the bucket directly without any credentials.
pub access_key_id: Option<String>,
/// "Password" to use when connecting to bucket.
pub secret_access_key: Option<String>,
/// A base URL to send S3 requests to.
/// By default, the endpoint is derived from a region name, assuming it's
/// an AWS S3 region name, erroring on wrong region name.
/// Endpoint provides a way to support other S3 flavors and their regions.
///
/// Example: `http://127.0.0.1:5000`
pub endpoint: Option<String>,
/// AWS S3 has various limits on its API calls, we need not to exceed those.
/// See [`defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details.
pub concurrency_limit: NonZeroUsize,
}
impl std::fmt::Debug for S3Config {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("S3Config")
.field("bucket_name", &self.bucket_name)
.field("bucket_region", &self.bucket_region)
.field("prefix_in_bucket", &self.prefix_in_bucket)
.field("concurrency_limit", &self.concurrency_limit)
.finish()
}
}
impl PageServerConf {
//
// Repository paths, relative to workdir.
@@ -422,7 +369,7 @@ impl PageServerConf {
/// validating the input and failing on errors.
///
/// This leaves any options not present in the file in the built-in defaults.
pub fn parse_and_validate(toml: &Document, workdir: &Path) -> Result<Self> {
pub fn parse_and_validate(toml: &Document, workdir: &Path) -> anyhow::Result<Self> {
let mut builder = PageServerConfigBuilder::default();
builder.workdir(workdir.to_owned());
@@ -449,12 +396,22 @@ impl PageServerConf {
"remote_storage" => {
builder.remote_storage_config(Some(Self::parse_remote_storage_config(item)?))
}
"tenant_conf" => {
"tenant_config" => {
t_conf = Self::parse_toml_tenant_conf(item)?;
}
"id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)),
"emit_wal_metadata" => builder.emit_wal_metadata(true),
"profiling" => builder.profiling(parse_toml_from_str(key, item)?),
"broker_etcd_prefix" => builder.broker_etcd_prefix(parse_toml_string(key, item)?),
"broker_endpoints" => builder.broker_endpoints(
parse_toml_array(key, item)?
.into_iter()
.map(|endpoint_str| {
endpoint_str.parse::<Url>().with_context(|| {
format!("Array item {endpoint_str} for key {key} is not a valid url endpoint")
})
})
.collect::<anyhow::Result<_>>()?,
),
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -533,21 +490,21 @@ impl PageServerConf {
let bucket_name = toml.get("bucket_name");
let bucket_region = toml.get("bucket_region");
let max_concurrent_timelines_sync = NonZeroUsize::new(
parse_optional_integer("max_concurrent_timelines_sync", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC),
let max_concurrent_syncs = NonZeroUsize::new(
parse_optional_integer("max_concurrent_syncs", toml)?
.unwrap_or(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS),
)
.context("Failed to parse 'max_concurrent_timelines_sync' as a positive integer")?;
.context("Failed to parse 'max_concurrent_syncs' as a positive integer")?;
let max_sync_errors = NonZeroU32::new(
parse_optional_integer("max_sync_errors", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS),
.unwrap_or(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS),
)
.context("Failed to parse 'max_sync_errors' as a positive integer")?;
let concurrency_limit = NonZeroUsize::new(
parse_optional_integer("concurrency_limit", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT),
.unwrap_or(remote_storage::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT),
)
.context("Failed to parse 'concurrency_limit' as a positive integer")?;
@@ -562,16 +519,6 @@ impl PageServerConf {
(None, Some(bucket_name), Some(bucket_region)) => RemoteStorageKind::AwsS3(S3Config {
bucket_name: parse_toml_string("bucket_name", bucket_name)?,
bucket_region: parse_toml_string("bucket_region", bucket_region)?,
access_key_id: toml
.get("access_key_id")
.map(|access_key_id| parse_toml_string("access_key_id", access_key_id))
.transpose()?,
secret_access_key: toml
.get("secret_access_key")
.map(|secret_access_key| {
parse_toml_string("secret_access_key", secret_access_key)
})
.transpose()?,
prefix_in_bucket: toml
.get("prefix_in_bucket")
.map(|prefix_in_bucket| parse_toml_string("prefix_in_bucket", prefix_in_bucket))
@@ -589,7 +536,7 @@ impl PageServerConf {
};
Ok(RemoteStorageConfig {
max_concurrent_timelines_sync,
max_concurrent_syncs,
max_sync_errors,
storage,
})
@@ -616,9 +563,10 @@ impl PageServerConf {
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
emit_wal_metadata: false,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::dummy_conf(),
broker_endpoints: Vec::new(),
broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
}
}
}
@@ -669,14 +617,36 @@ fn parse_toml_duration(name: &str, item: &Item) -> Result<Duration> {
Ok(humantime::parse_duration(s)?)
}
fn parse_toml_from_str<T>(name: &str, item: &Item) -> Result<T>
fn parse_toml_from_str<T>(name: &str, item: &Item) -> anyhow::Result<T>
where
T: FromStr<Err = anyhow::Error>,
T: FromStr,
<T as FromStr>::Err: std::fmt::Display,
{
let v = item
.as_str()
.with_context(|| format!("configure option {name} is not a string"))?;
T::from_str(v)
T::from_str(v).map_err(|e| {
anyhow!(
"Failed to parse string as {parse_type} for configure option {name}: {e}",
parse_type = stringify!(T)
)
})
}
fn parse_toml_array(name: &str, item: &Item) -> anyhow::Result<Vec<String>> {
let array = item
.as_array()
.with_context(|| format!("configure option {name} is not an array"))?;
array
.iter()
.map(|value| {
value
.as_str()
.map(str::to_string)
.with_context(|| format!("Array item {value:?} for key {name} is not a string"))
})
.collect()
}
#[cfg(test)]
@@ -709,12 +679,16 @@ id = 10
fn parse_defaults() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
// we have to create dummy pathes to overcome the validation errors
let config_string = format!("pg_distrib_dir='{}'\nid=10", pg_distrib_dir.display());
let broker_endpoint = "http://127.0.0.1:7777";
// we have to create dummy values to overcome the validation errors
let config_string = format!(
"pg_distrib_dir='{}'\nid=10\nbroker_endpoints = ['{broker_endpoint}']",
pg_distrib_dir.display()
);
let toml = config_string.parse()?;
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e:?}"));
assert_eq!(
parsed_config,
@@ -732,9 +706,12 @@ id = 10
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
emit_wal_metadata: false,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
broker_endpoints: vec![broker_endpoint
.parse()
.expect("Failed to parse a valid broker endpoint URL")],
broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
},
"Correct defaults should be used when no config values are provided"
);
@@ -746,15 +723,16 @@ id = 10
fn parse_basic_config() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let broker_endpoint = "http://127.0.0.1:7777";
let config_string = format!(
"{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'",
"{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'\nbroker_endpoints = ['{broker_endpoint}']",
pg_distrib_dir.display()
);
let toml = config_string.parse()?;
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e:?}"));
assert_eq!(
parsed_config,
@@ -772,9 +750,12 @@ id = 10
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
emit_wal_metadata: false,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
broker_endpoints: vec![broker_endpoint
.parse()
.expect("Failed to parse a valid broker endpoint URL")],
broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
},
"Should be able to parse all basic config values correctly"
);
@@ -786,6 +767,7 @@ id = 10
fn parse_remote_fs_storage_config() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let broker_endpoint = "http://127.0.0.1:7777";
let local_storage_path = tempdir.path().join("local_remote_storage");
@@ -805,6 +787,7 @@ local_path = '{}'"#,
let config_string = format!(
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
broker_endpoints = ['{broker_endpoint}']
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
@@ -813,18 +796,20 @@ pg_distrib_dir='{}'
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.unwrap_or_else(|e| {
panic!("Failed to parse config '{config_string}', reason: {e:?}")
})
.remote_storage_config
.expect("Should have remote storage config for the local FS");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_timelines_sync: NonZeroUsize::new(
defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC
max_concurrent_syncs: NonZeroUsize::new(
remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS
)
.unwrap(),
max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
.unwrap(),
max_sync_errors: NonZeroU32::new(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
},
@@ -842,29 +827,26 @@ pg_distrib_dir='{}'
let bucket_name = "some-sample-bucket".to_string();
let bucket_region = "eu-north-1".to_string();
let prefix_in_bucket = "test_prefix".to_string();
let access_key_id = "SOMEKEYAAAAASADSAH*#".to_string();
let secret_access_key = "SOMEsEcReTsd292v".to_string();
let endpoint = "http://localhost:5000".to_string();
let max_concurrent_timelines_sync = NonZeroUsize::new(111).unwrap();
let max_concurrent_syncs = NonZeroUsize::new(111).unwrap();
let max_sync_errors = NonZeroU32::new(222).unwrap();
let s3_concurrency_limit = NonZeroUsize::new(333).unwrap();
let broker_endpoint = "http://127.0.0.1:7777";
let identical_toml_declarations = &[
format!(
r#"[remote_storage]
max_concurrent_timelines_sync = {max_concurrent_timelines_sync}
max_concurrent_syncs = {max_concurrent_syncs}
max_sync_errors = {max_sync_errors}
bucket_name = '{bucket_name}'
bucket_region = '{bucket_region}'
prefix_in_bucket = '{prefix_in_bucket}'
access_key_id = '{access_key_id}'
secret_access_key = '{secret_access_key}'
endpoint = '{endpoint}'
concurrency_limit = {s3_concurrency_limit}"#
),
format!(
"remote_storage={{max_concurrent_timelines_sync={max_concurrent_timelines_sync}, max_sync_errors={max_sync_errors}, bucket_name='{bucket_name}',\
bucket_region='{bucket_region}', prefix_in_bucket='{prefix_in_bucket}', access_key_id='{access_key_id}', secret_access_key='{secret_access_key}', endpoint='{endpoint}', concurrency_limit={s3_concurrency_limit}}}",
"remote_storage={{max_concurrent_syncs={max_concurrent_syncs}, max_sync_errors={max_sync_errors}, bucket_name='{bucket_name}',\
bucket_region='{bucket_region}', prefix_in_bucket='{prefix_in_bucket}', endpoint='{endpoint}', concurrency_limit={s3_concurrency_limit}}}",
),
];
@@ -872,6 +854,7 @@ concurrency_limit = {s3_concurrency_limit}"#
let config_string = format!(
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
broker_endpoints = ['{broker_endpoint}']
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
@@ -880,20 +863,20 @@ pg_distrib_dir='{}'
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.unwrap_or_else(|e| {
panic!("Failed to parse config '{config_string}', reason: {e:?}")
})
.remote_storage_config
.expect("Should have remote storage config for S3");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_timelines_sync,
max_concurrent_syncs,
max_sync_errors,
storage: RemoteStorageKind::AwsS3(S3Config {
bucket_name: bucket_name.clone(),
bucket_region: bucket_region.clone(),
access_key_id: Some(access_key_id.clone()),
secret_access_key: Some(secret_access_key.clone()),
prefix_in_bucket: Some(prefix_in_bucket.clone()),
endpoint: Some(endpoint.clone()),
concurrency_limit: s3_concurrency_limit,

View File

@@ -31,6 +31,7 @@ pub struct TenantCreateRequest {
pub compaction_threshold: Option<usize>,
pub gc_horizon: Option<u64>,
pub gc_period: Option<String>,
pub image_creation_threshold: Option<usize>,
pub pitr_interval: Option<String>,
}
@@ -65,6 +66,7 @@ pub struct TenantConfigRequest {
pub compaction_threshold: Option<usize>,
pub gc_horizon: Option<u64>,
pub gc_period: Option<String>,
pub image_creation_threshold: Option<usize>,
pub pitr_interval: Option<String>,
}
@@ -78,6 +80,7 @@ impl TenantConfigRequest {
compaction_threshold: None,
gc_horizon: None,
gc_period: None,
image_creation_threshold: None,
pitr_interval: None,
}
}

View File

@@ -123,6 +123,53 @@ paths:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline/{timeline_id}/wal_receiver:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
format: hex
- name: timeline_id
in: path
required: true
schema:
type: string
format: hex
get:
description: Get wal receiver's data attached to the timeline
responses:
"200":
description: WalReceiverEntry
content:
application/json:
schema:
$ref: "#/components/schemas/WalReceiverEntry"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"404":
description: Error when no wal receiver is running or found
content:
application/json:
schema:
$ref: "#/components/schemas/NotFoundError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline/{timeline_id}/attach:
parameters:
@@ -520,6 +567,21 @@ components:
type: integer
current_logical_size_non_incremental:
type: integer
WalReceiverEntry:
type: object
required:
- thread_id
- wal_producer_connstr
properties:
thread_id:
type: integer
wal_producer_connstr:
type: string
last_received_msg_lsn:
type: string
format: hex
last_received_msg_ts:
type: integer
Error:
type: object

View File

@@ -3,17 +3,16 @@ use std::sync::Arc;
use anyhow::{Context, Result};
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use remote_storage::GenericRemoteStorage;
use tracing::*;
use super::models::{
StatusResponse, TenantConfigRequest, TenantCreateRequest, TenantCreateResponse,
TimelineCreateRequest,
};
use crate::config::RemoteStorageKind;
use crate::remote_storage::{
download_index_part, schedule_timeline_download, LocalFs, RemoteIndex, RemoteTimeline, S3Bucket,
};
use crate::repository::Repository;
use crate::storage_sync;
use crate::storage_sync::index::{RemoteIndex, RemoteTimeline};
use crate::tenant_config::TenantConfOpt;
use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo};
use crate::{config::PageServerConf, tenant_mgr, timelines};
@@ -37,11 +36,6 @@ struct State {
remote_storage: Option<GenericRemoteStorage>,
}
enum GenericRemoteStorage {
Local(LocalFs),
S3(S3Bucket),
}
impl State {
fn new(
conf: &'static PageServerConf,
@@ -57,14 +51,7 @@ impl State {
let remote_storage = conf
.remote_storage_config
.as_ref()
.map(|storage_config| match &storage_config.storage {
RemoteStorageKind::LocalFs(root) => {
LocalFs::new(root.clone(), &conf.workdir).map(GenericRemoteStorage::Local)
}
RemoteStorageKind::AwsS3(s3_config) => {
S3Bucket::new(s3_config, &conf.workdir).map(GenericRemoteStorage::S3)
}
})
.map(|storage_config| GenericRemoteStorage::new(conf.workdir.clone(), storage_config))
.transpose()
.context("Failed to init generic remote storage")?;
@@ -179,43 +166,47 @@ async fn timeline_detail_handler(request: Request<Body>) -> Result<Response<Body
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
let span = info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id);
let (local_timeline_info, remote_timeline_info) = async {
// any error here will render local timeline as None
// XXX .in_current_span does not attach messages in spawn_blocking future to current future's span
let local_timeline_info = tokio::task::spawn_blocking(move || {
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
let local_timeline = {
repo.get_timeline(timeline_id)
.as_ref()
.map(|timeline| {
LocalTimelineInfo::from_repo_timeline(
tenant_id,
timeline_id,
timeline,
include_non_incremental_logical_size,
)
})
.transpose()?
};
Ok::<_, anyhow::Error>(local_timeline)
})
.await
.ok()
.and_then(|r| r.ok())
.flatten();
let (local_timeline_info, span) = tokio::task::spawn_blocking(move || {
let entered = span.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
let local_timeline = {
repo.get_timeline(timeline_id)
.as_ref()
.map(|timeline| {
LocalTimelineInfo::from_repo_timeline(
tenant_id,
timeline_id,
timeline,
include_non_incremental_logical_size,
)
let remote_timeline_info = {
let remote_index_read = get_state(&request).remote_index.read().await;
remote_index_read
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.map(|remote_entry| RemoteTimelineInfo {
remote_consistent_lsn: remote_entry.metadata.disk_consistent_lsn(),
awaits_download: remote_entry.awaits_download,
})
.transpose()?
};
Ok::<_, anyhow::Error>((local_timeline, entered.exit()))
})
.await
.map_err(ApiError::from_err)??;
let remote_timeline_info = {
let remote_index_read = get_state(&request).remote_index.read().await;
remote_index_read
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.map(|remote_entry| RemoteTimelineInfo {
remote_consistent_lsn: remote_entry.metadata.disk_consistent_lsn(),
awaits_download: remote_entry.awaits_download,
})
};
let _enter = span.entered();
(local_timeline_info, remote_timeline_info)
}
.instrument(info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id))
.await;
if local_timeline_info.is_none() && remote_timeline_info.is_none() {
return Err(ApiError::NotFound(
@@ -233,6 +224,30 @@ async fn timeline_detail_handler(request: Request<Body>) -> Result<Response<Body
json_response(StatusCode::OK, timeline_info)
}
async fn wal_receiver_get_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let wal_receiver = tokio::task::spawn_blocking(move || {
let _enter =
info_span!("wal_receiver_get", tenant = %tenant_id, timeline = %timeline_id).entered();
crate::walreceiver::get_wal_receiver_entry(tenant_id, timeline_id)
})
.await
.map_err(ApiError::from_err)?
.ok_or_else(|| {
ApiError::NotFound(format!(
"WAL receiver not found for tenant {} and timeline {}",
tenant_id, timeline_id
))
})?;
json_response(StatusCode::OK, wal_receiver)
}
async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
@@ -244,7 +259,7 @@ async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body
);
tokio::task::spawn_blocking(move || {
if tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id).is_ok() {
if tenant_mgr::get_local_timeline_with_load(tenant_id, timeline_id).is_ok() {
// TODO: maybe answer with 309 Not Modified here?
anyhow::bail!("Timeline is already present locally")
};
@@ -269,14 +284,14 @@ async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body
}
remote_timeline.awaits_download = true;
schedule_timeline_download(tenant_id, timeline_id);
storage_sync::schedule_layer_download(tenant_id, timeline_id);
return json_response(StatusCode::ACCEPTED, ());
} else {
// no timeline in the index, release the lock to make the potentially lengthy download opetation
drop(index_accessor);
}
let new_timeline = match try_download_shard_data(state, sync_id).await {
let new_timeline = match try_download_index_part_data(state, sync_id).await {
Ok(Some(mut new_timeline)) => {
tokio::fs::create_dir_all(state.conf.timeline_path(&timeline_id, &tenant_id))
.await
@@ -305,35 +320,32 @@ async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body
}
None => index_accessor.add_timeline_entry(sync_id, new_timeline),
}
schedule_timeline_download(tenant_id, timeline_id);
storage_sync::schedule_layer_download(tenant_id, timeline_id);
json_response(StatusCode::ACCEPTED, ())
}
async fn try_download_shard_data(
async fn try_download_index_part_data(
state: &State,
sync_id: ZTenantTimelineId,
) -> anyhow::Result<Option<RemoteTimeline>> {
let shard = match state.remote_storage.as_ref() {
let index_part = match state.remote_storage.as_ref() {
Some(GenericRemoteStorage::Local(local_storage)) => {
download_index_part(state.conf, local_storage, sync_id).await
storage_sync::download_index_part(state.conf, local_storage, sync_id).await
}
Some(GenericRemoteStorage::S3(s3_storage)) => {
download_index_part(state.conf, s3_storage, sync_id).await
storage_sync::download_index_part(state.conf, s3_storage, sync_id).await
}
None => return Ok(None),
}
.with_context(|| format!("Failed to download index shard for timeline {}", sync_id))?;
.with_context(|| format!("Failed to download index part for timeline {sync_id}"))?;
let timeline_path = state
.conf
.timeline_path(&sync_id.timeline_id, &sync_id.tenant_id);
RemoteTimeline::from_index_part(&timeline_path, shard)
RemoteTimeline::from_index_part(&timeline_path, index_part)
.map(Some)
.with_context(|| {
format!(
"Failed to convert index shard into remote timeline for timeline {}",
sync_id
)
format!("Failed to convert index part into remote timeline for timeline {sync_id}")
})
}
@@ -347,8 +359,8 @@ async fn timeline_detach_handler(request: Request<Body>) -> Result<Response<Body
let _enter =
info_span!("timeline_detach_handler", tenant = %tenant_id, timeline = %timeline_id)
.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
repo.detach_timeline(timeline_id)
let state = get_state(&request);
tenant_mgr::detach_timeline(state.conf, tenant_id, timeline_id)
})
.await
.map_err(ApiError::from_err)??;
@@ -365,7 +377,7 @@ async fn tenant_list_handler(request: Request<Body>) -> Result<Response<Body>, A
crate::tenant_mgr::list_tenants()
})
.await
.map_err(ApiError::from_err)??;
.map_err(ApiError::from_err)?;
json_response(StatusCode::OK, response_data)
}
@@ -377,12 +389,13 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
let request_data: TenantCreateRequest = json_request(&mut request).await?;
let remote_index = get_state(&request).remote_index.clone();
let mut tenant_conf: TenantConfOpt = Default::default();
let mut tenant_conf = TenantConfOpt::default();
if let Some(gc_period) = request_data.gc_period {
tenant_conf.gc_period =
Some(humantime::parse_duration(&gc_period).map_err(ApiError::from_err)?);
}
tenant_conf.gc_horizon = request_data.gc_horizon;
tenant_conf.image_creation_threshold = request_data.image_creation_threshold;
if let Some(pitr_interval) = request_data.pitr_interval {
tenant_conf.pitr_interval =
@@ -430,6 +443,7 @@ async fn tenant_config_handler(mut request: Request<Body>) -> Result<Response<Bo
Some(humantime::parse_duration(&gc_period).map_err(ApiError::from_err)?);
}
tenant_conf.gc_horizon = request_data.gc_horizon;
tenant_conf.image_creation_threshold = request_data.image_creation_threshold;
if let Some(pitr_interval) = request_data.pitr_interval {
tenant_conf.pitr_interval =
@@ -495,6 +509,10 @@ pub fn make_router(
"/v1/tenant/:tenant_id/timeline/:timeline_id",
timeline_detail_handler,
)
.get(
"/v1/tenant/:tenant_id/timeline/:timeline_id/wal_receiver",
wal_receiver_get_handler,
)
.post(
"/v1/tenant/:tenant_id/timeline/:timeline_id/attach",
timeline_attach_handler,

View File

@@ -2,7 +2,6 @@
//! Import data and WAL from a PostgreSQL data directory and WAL segments into
//! a zenith Timeline.
//!
use std::fs;
use std::fs::File;
use std::io::{Read, Seek, SeekFrom};
use std::path::{Path, PathBuf};
@@ -10,16 +9,18 @@ use std::path::{Path, PathBuf};
use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
use tracing::*;
use walkdir::WalkDir;
use crate::pgdatadir_mapping::*;
use crate::reltag::{RelTag, SlruKind};
use crate::repository::Repository;
use crate::repository::Timeline;
use crate::walingest::WalIngest;
use postgres_ffi::relfile_utils::*;
use postgres_ffi::waldecoder::*;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::Oid;
use postgres_ffi::{pg_constants, ControlFileData, DBState_DB_SHUTDOWNED};
use postgres_ffi::{Oid, TransactionId};
use utils::lsn::Lsn;
///
@@ -35,100 +36,30 @@ pub fn import_timeline_from_postgres_datadir<R: Repository>(
) -> Result<()> {
let mut pg_control: Option<ControlFileData> = None;
// TODO this shoud be start_lsn, which is not necessarily equal to end_lsn (aka lsn)
// Then fishing out pg_control would be unnecessary
let mut modification = tline.begin_modification(lsn);
modification.init_empty()?;
// Scan 'global'
let mut relfiles: Vec<PathBuf> = Vec::new();
for direntry in fs::read_dir(path.join("global"))? {
let direntry = direntry?;
match direntry.file_name().to_str() {
None => continue,
// Import all but pg_wal
let all_but_wal = WalkDir::new(path)
.into_iter()
.filter_entry(|entry| !entry.path().ends_with("pg_wal"));
for entry in all_but_wal {
let entry = entry?;
let metadata = entry.metadata().expect("error getting dir entry metadata");
if metadata.is_file() {
let absolute_path = entry.path();
let relative_path = absolute_path.strip_prefix(path)?;
Some("pg_control") => {
pg_control = Some(import_control_file(&mut modification, &direntry.path())?);
let file = File::open(absolute_path)?;
let len = metadata.len() as usize;
if let Some(control_file) = import_file(&mut modification, relative_path, file, len)? {
pg_control = Some(control_file);
}
Some("pg_filenode.map") => {
import_relmap_file(
&mut modification,
pg_constants::GLOBALTABLESPACE_OID,
0,
&direntry.path(),
)?;
}
// Load any relation files into the page server (but only after the other files)
_ => relfiles.push(direntry.path()),
modification.flush()?;
}
}
for relfile in relfiles {
import_relfile(
&mut modification,
&relfile,
pg_constants::GLOBALTABLESPACE_OID,
0,
)?;
}
// Scan 'base'. It contains database dirs, the database OID is the filename.
// E.g. 'base/12345', where 12345 is the database OID.
for direntry in fs::read_dir(path.join("base"))? {
let direntry = direntry?;
//skip all temporary files
if direntry.file_name().to_string_lossy() == "pgsql_tmp" {
continue;
}
let dboid = direntry.file_name().to_string_lossy().parse::<u32>()?;
let mut relfiles: Vec<PathBuf> = Vec::new();
for direntry in fs::read_dir(direntry.path())? {
let direntry = direntry?;
match direntry.file_name().to_str() {
None => continue,
Some("PG_VERSION") => {
//modification.put_dbdir_creation(pg_constants::DEFAULTTABLESPACE_OID, dboid)?;
}
Some("pg_filenode.map") => import_relmap_file(
&mut modification,
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
&direntry.path(),
)?,
// Load any relation files into the page server
_ => relfiles.push(direntry.path()),
}
}
for relfile in relfiles {
import_relfile(
&mut modification,
&relfile,
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
)?;
}
}
for entry in fs::read_dir(path.join("pg_xact"))? {
let entry = entry?;
import_slru_file(&mut modification, SlruKind::Clog, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("members"))? {
let entry = entry?;
import_slru_file(&mut modification, SlruKind::MultiXactMembers, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("offsets"))? {
let entry = entry?;
import_slru_file(&mut modification, SlruKind::MultiXactOffsets, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_twophase"))? {
let entry = entry?;
let xid = u32::from_str_radix(&entry.path().to_string_lossy(), 16)?;
import_twophase_file(&mut modification, xid, &entry.path())?;
}
// TODO: Scan pg_tblspc
// We're done importing all the data files.
modification.commit()?;
@@ -158,31 +89,30 @@ pub fn import_timeline_from_postgres_datadir<R: Repository>(
}
// subroutine of import_timeline_from_postgres_datadir(), to load one relation file.
fn import_relfile<R: Repository>(
fn import_rel<R: Repository, Reader: Read>(
modification: &mut DatadirModification<R>,
path: &Path,
spcoid: Oid,
dboid: Oid,
mut reader: Reader,
len: usize,
) -> anyhow::Result<()> {
// Does it look like a relation file?
trace!("importing rel file {}", path.display());
let (relnode, forknum, segno) = parse_relfilename(&path.file_name().unwrap().to_string_lossy())
.map_err(|e| {
warn!("unrecognized file in postgres datadir: {:?} ({})", path, e);
e
})?;
let filename = &path
.file_name()
.expect("missing rel filename")
.to_string_lossy();
let (relnode, forknum, segno) = parse_relfilename(filename).map_err(|e| {
warn!("unrecognized file in postgres datadir: {:?} ({})", path, e);
e
})?;
let mut file = File::open(path)?;
let mut buf: [u8; 8192] = [0u8; 8192];
let len = file.metadata().unwrap().len();
ensure!(len % pg_constants::BLCKSZ as u64 == 0);
let nblocks = len / pg_constants::BLCKSZ as u64;
if segno != 0 {
todo!();
}
ensure!(len % pg_constants::BLCKSZ as usize == 0);
let nblocks = len / pg_constants::BLCKSZ as usize;
let rel = RelTag {
spcnode: spcoid,
@@ -190,11 +120,22 @@ fn import_relfile<R: Repository>(
relnode,
forknum,
};
modification.put_rel_creation(rel, nblocks as u32)?;
let mut blknum: u32 = segno * (1024 * 1024 * 1024 / pg_constants::BLCKSZ as u32);
// Call put_rel_creation for every segment of the relation,
// because there is no guarantee about the order in which we are processing segments.
// ignore "relation already exists" error
if let Err(e) = modification.put_rel_creation(rel, nblocks as u32) {
if e.to_string().contains("already exists") {
debug!("relation {} already exists. we must be extending it", rel);
} else {
return Err(e);
}
}
loop {
let r = file.read_exact(&mut buf);
let r = reader.read_exact(&mut buf);
match r {
Ok(_) => {
modification.put_rel_page_image(rel, blknum, Bytes::copy_from_slice(&buf))?;
@@ -204,7 +145,9 @@ fn import_relfile<R: Repository>(
Err(err) => match err.kind() {
std::io::ErrorKind::UnexpectedEof => {
// reached EOF. That's expected.
ensure!(blknum == nblocks as u32, "unexpected EOF");
let relative_blknum =
blknum - segno * (1024 * 1024 * 1024 / pg_constants::BLCKSZ as u32);
ensure!(relative_blknum == nblocks as u32, "unexpected EOF");
break;
}
_ => {
@@ -215,96 +158,43 @@ fn import_relfile<R: Repository>(
blknum += 1;
}
// Update relation size
//
// If we process rel segments out of order,
// put_rel_extend will skip the update.
modification.put_rel_extend(rel, blknum)?;
Ok(())
}
/// Import a relmapper (pg_filenode.map) file into the repository
fn import_relmap_file<R: Repository>(
modification: &mut DatadirModification<R>,
spcnode: Oid,
dbnode: Oid,
path: &Path,
) -> Result<()> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
// read the whole file
file.read_to_end(&mut buffer)?;
trace!("importing relmap file {}", path.display());
modification.put_relmap_file(spcnode, dbnode, Bytes::copy_from_slice(&buffer[..]))?;
Ok(())
}
/// Import a twophase state file (pg_twophase/<xid>) into the repository
fn import_twophase_file<R: Repository>(
modification: &mut DatadirModification<R>,
xid: TransactionId,
path: &Path,
) -> Result<()> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
// read the whole file
file.read_to_end(&mut buffer)?;
trace!("importing non-rel file {}", path.display());
modification.put_twophase_file(xid, Bytes::copy_from_slice(&buffer[..]))?;
Ok(())
}
///
/// Import pg_control file into the repository.
///
/// The control file is imported as is, but we also extract the checkpoint record
/// from it and store it separated.
fn import_control_file<R: Repository>(
modification: &mut DatadirModification<R>,
path: &Path,
) -> Result<ControlFileData> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
// read the whole file
file.read_to_end(&mut buffer)?;
trace!("importing control file {}", path.display());
// Import it as ControlFile
modification.put_control_file(Bytes::copy_from_slice(&buffer[..]))?;
// Extract the checkpoint record and import it separately.
let pg_control = ControlFileData::decode(&buffer)?;
let checkpoint_bytes = pg_control.checkPointCopy.encode();
modification.put_checkpoint(checkpoint_bytes)?;
Ok(pg_control)
}
///
/// Import an SLRU segment file
///
fn import_slru_file<R: Repository>(
fn import_slru<R: Repository, Reader: Read>(
modification: &mut DatadirModification<R>,
slru: SlruKind,
path: &Path,
mut reader: Reader,
len: usize,
) -> Result<()> {
trace!("importing slru file {}", path.display());
let mut file = File::open(path)?;
let mut buf: [u8; 8192] = [0u8; 8192];
let segno = u32::from_str_radix(&path.file_name().unwrap().to_string_lossy(), 16)?;
let filename = &path
.file_name()
.expect("missing slru filename")
.to_string_lossy();
let segno = u32::from_str_radix(filename, 16)?;
let len = file.metadata().unwrap().len();
ensure!(len % pg_constants::BLCKSZ as u64 == 0); // we assume SLRU block size is the same as BLCKSZ
let nblocks = len / pg_constants::BLCKSZ as u64;
ensure!(len % pg_constants::BLCKSZ as usize == 0); // we assume SLRU block size is the same as BLCKSZ
let nblocks = len / pg_constants::BLCKSZ as usize;
ensure!(nblocks <= pg_constants::SLRU_PAGES_PER_SEGMENT as u64);
ensure!(nblocks <= pg_constants::SLRU_PAGES_PER_SEGMENT as usize);
modification.put_slru_segment_creation(slru, segno, nblocks as u32)?;
let mut rpageno = 0;
loop {
let r = file.read_exact(&mut buf);
let r = reader.read_exact(&mut buf);
match r {
Ok(_) => {
modification.put_slru_page_image(
@@ -396,10 +286,272 @@ fn import_wal<R: Repository>(
}
if last_lsn != startpoint {
debug!("reached end of WAL at {}", last_lsn);
info!("reached end of WAL at {}", last_lsn);
} else {
info!("no WAL to import at {}", last_lsn);
}
Ok(())
}
pub fn import_basebackup_from_tar<R: Repository, Reader: Read>(
tline: &mut DatadirTimeline<R>,
reader: Reader,
base_lsn: Lsn,
) -> Result<()> {
info!("importing base at {}", base_lsn);
let mut modification = tline.begin_modification(base_lsn);
modification.init_empty()?;
let mut pg_control: Option<ControlFileData> = None;
// Import base
for base_tar_entry in tar::Archive::new(reader).entries()? {
let entry = base_tar_entry?;
let header = entry.header();
let len = header.entry_size()? as usize;
let file_path = header.path()?.into_owned();
match header.entry_type() {
tar::EntryType::Regular => {
if let Some(res) = import_file(&mut modification, file_path.as_ref(), entry, len)? {
// We found the pg_control file.
pg_control = Some(res);
}
modification.flush()?;
}
tar::EntryType::Directory => {
debug!("directory {:?}", file_path);
}
_ => {
panic!("tar::EntryType::?? {}", file_path.display());
}
}
}
// sanity check: ensure that pg_control is loaded
let _pg_control = pg_control.context("pg_control file not found")?;
modification.commit()?;
Ok(())
}
pub fn import_wal_from_tar<R: Repository, Reader: Read>(
tline: &mut DatadirTimeline<R>,
reader: Reader,
start_lsn: Lsn,
end_lsn: Lsn,
) -> Result<()> {
// Set up walingest mutable state
let mut waldecoder = WalStreamDecoder::new(start_lsn);
let mut segno = start_lsn.segment_number(pg_constants::WAL_SEGMENT_SIZE);
let mut offset = start_lsn.segment_offset(pg_constants::WAL_SEGMENT_SIZE);
let mut last_lsn = start_lsn;
let mut walingest = WalIngest::new(tline, start_lsn)?;
// Ingest wal until end_lsn
info!("importing wal until {}", end_lsn);
let mut pg_wal_tar = tar::Archive::new(reader);
let mut pg_wal_entries_iter = pg_wal_tar.entries()?;
while last_lsn <= end_lsn {
let bytes = {
let entry = pg_wal_entries_iter.next().expect("expected more wal")?;
let header = entry.header();
let file_path = header.path()?.into_owned();
match header.entry_type() {
tar::EntryType::Regular => {
// FIXME: assume postgresql tli 1 for now
let expected_filename = XLogFileName(1, segno, pg_constants::WAL_SEGMENT_SIZE);
let file_name = file_path
.file_name()
.expect("missing wal filename")
.to_string_lossy();
ensure!(expected_filename == file_name);
debug!("processing wal file {:?}", file_path);
read_all_bytes(entry)?
}
tar::EntryType::Directory => {
debug!("directory {:?}", file_path);
continue;
}
_ => {
panic!("tar::EntryType::?? {}", file_path.display());
}
}
};
waldecoder.feed_bytes(&bytes[offset..]);
while last_lsn <= end_lsn {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
walingest.ingest_record(tline, recdata, lsn)?;
last_lsn = lsn;
debug!("imported record at {} (end {})", lsn, end_lsn);
}
}
debug!("imported records up to {}", last_lsn);
segno += 1;
offset = 0;
}
if last_lsn != start_lsn {
info!("reached end of WAL at {}", last_lsn);
} else {
info!("there was no WAL to import at {}", last_lsn);
}
// Log any extra unused files
for e in &mut pg_wal_entries_iter {
let entry = e?;
let header = entry.header();
let file_path = header.path()?.into_owned();
info!("skipping {:?}", file_path);
}
Ok(())
}
pub fn import_file<R: Repository, Reader: Read>(
modification: &mut DatadirModification<R>,
file_path: &Path,
reader: Reader,
len: usize,
) -> Result<Option<ControlFileData>> {
debug!("looking at {:?}", file_path);
if file_path.starts_with("global") {
let spcnode = pg_constants::GLOBALTABLESPACE_OID;
let dbnode = 0;
match file_path
.file_name()
.expect("missing filename")
.to_string_lossy()
.as_ref()
{
"pg_control" => {
let bytes = read_all_bytes(reader)?;
// Extract the checkpoint record and import it separately.
let pg_control = ControlFileData::decode(&bytes[..])?;
let checkpoint_bytes = pg_control.checkPointCopy.encode()?;
modification.put_checkpoint(checkpoint_bytes)?;
debug!("imported control file");
// Import it as ControlFile
modification.put_control_file(bytes)?;
return Ok(Some(pg_control));
}
"pg_filenode.map" => {
let bytes = read_all_bytes(reader)?;
modification.put_relmap_file(spcnode, dbnode, bytes)?;
debug!("imported relmap file")
}
"PG_VERSION" => {
debug!("ignored");
}
_ => {
import_rel(modification, file_path, spcnode, dbnode, reader, len)?;
debug!("imported rel creation");
}
}
} else if file_path.starts_with("base") {
let spcnode = pg_constants::DEFAULTTABLESPACE_OID;
let dbnode: u32 = file_path
.iter()
.nth(1)
.expect("invalid file path, expected dbnode")
.to_string_lossy()
.parse()?;
match file_path
.file_name()
.expect("missing base filename")
.to_string_lossy()
.as_ref()
{
"pg_filenode.map" => {
let bytes = read_all_bytes(reader)?;
modification.put_relmap_file(spcnode, dbnode, bytes)?;
debug!("imported relmap file")
}
"PG_VERSION" => {
debug!("ignored");
}
_ => {
import_rel(modification, file_path, spcnode, dbnode, reader, len)?;
debug!("imported rel creation");
}
}
} else if file_path.starts_with("pg_xact") {
let slru = SlruKind::Clog;
import_slru(modification, slru, file_path, reader, len)?;
debug!("imported clog slru");
} else if file_path.starts_with("pg_multixact/offsets") {
let slru = SlruKind::MultiXactOffsets;
import_slru(modification, slru, file_path, reader, len)?;
debug!("imported multixact offsets slru");
} else if file_path.starts_with("pg_multixact/members") {
let slru = SlruKind::MultiXactMembers;
import_slru(modification, slru, file_path, reader, len)?;
debug!("imported multixact members slru");
} else if file_path.starts_with("pg_twophase") {
let file_name = &file_path
.file_name()
.expect("missing twophase filename")
.to_string_lossy();
let xid = u32::from_str_radix(file_name, 16)?;
let bytes = read_all_bytes(reader)?;
modification.put_twophase_file(xid, Bytes::copy_from_slice(&bytes[..]))?;
debug!("imported twophase file");
} else if file_path.starts_with("pg_wal") {
debug!("found wal file in base section. ignore it");
} else if file_path.starts_with("zenith.signal") {
// Parse zenith signal file to set correct previous LSN
let bytes = read_all_bytes(reader)?;
// zenith.signal format is "PREV LSN: prev_lsn"
// TODO write serialization and deserialization in the same place.
let zenith_signal = std::str::from_utf8(&bytes)?.trim();
let prev_lsn = match zenith_signal {
"PREV LSN: none" => Lsn(0),
"PREV LSN: invalid" => Lsn(0),
other => {
let split = other.split(':').collect::<Vec<_>>();
split[1]
.trim()
.parse::<Lsn>()
.context("can't parse zenith.signal")?
}
};
// zenith.signal is not necessarily the last file, that we handle
// but it is ok to call `finish_write()`, because final `modification.commit()`
// will update lsn once more to the final one.
let writer = modification.tline.tline.writer();
writer.finish_write(prev_lsn);
debug!("imported zenith signal {}", prev_lsn);
} else if file_path.starts_with("pg_tblspc") {
// TODO Backups exported from neon won't have pg_tblspc, but we will need
// this to import arbitrary postgres databases.
bail!("Importing pg_tblspc is not implemented");
} else {
debug!("ignored");
}
Ok(None)
}
fn read_all_bytes<Reader: Read>(mut reader: Reader) -> Result<Bytes> {
let mut buf: Vec<u8> = vec![];
reader.read_to_end(&mut buf)?;
Ok(Bytes::copy_from_slice(&buf[..]))
}

File diff suppressed because it is too large Load Diff

View File

@@ -23,6 +23,7 @@ distribution depends on the workload: the updates could be totally random, or
there could be a long stream of updates to a single relation when data is bulk
loaded, for example, or something in between.
```
Cloud Storage Page Server Safekeeper
L1 L0 Memory WAL
@@ -37,6 +38,7 @@ Cloud Storage Page Server Safekeeper
+----+----+ +----+----+ | | |
|EEEE| |EEEE|EEEE| +---+-----+
+----+ +----+----+
```
In this illustration, WAL is received as a stream from the Safekeeper, from the
right. It is immediately captured by the page server and stored quickly in
@@ -47,7 +49,7 @@ the same page and relation close to each other.
From the page server memory, whenever enough WAL has been accumulated, it is flushed
to disk into a new L0 layer file, and the memory is released.
When enough L0 files have been accumulated, they are merged together rand sliced
When enough L0 files have been accumulated, they are merged together and sliced
per key-space, producing a new set of files where each file contains a more
narrow key range, but larger LSN range.
@@ -121,7 +123,7 @@ The files are called "layer files". Each layer file covers a range of keys, and
a range of LSNs (or a single LSN, in case of image layers). You can think of it
as a rectangle in the two-dimensional key-LSN space. The layer files for each
timeline are stored in the timeline's subdirectory under
.zenith/tenants/<tenantid>/timelines.
`.zenith/tenants/<tenantid>/timelines`.
There are two kind of layer files: images, and delta layers. An image file
contains a snapshot of all keys at a particular LSN, whereas a delta file
@@ -130,8 +132,11 @@ range of LSN.
image file:
```
000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568
start key end key LSN
```
The first parts define the key range that the layer covers. See
pgdatadir_mapping.rs for how the key space is used. The last part is the LSN.
@@ -140,8 +145,10 @@ delta file:
Delta files are named similarly, but they cover a range of LSNs:
```
000000067F000032BE0000400000000020B6-000000067F000032BE0000400000000030B6__000000578C6B29-0000000057A50051
start key end key start LSN end LSN
```
A delta file contains all the key-values in the key-range that were updated in
the LSN range. If a key has not been modified, there is no trace of it in the
@@ -151,7 +158,9 @@ delta layer.
A delta layer file can cover a part of the overall key space, as in the previous
example, or the whole key range like this:
```
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000578C6B29-0000000057A50051
```
A file that covers the whole key range is called a L0 file (Level 0), while a
file that covers only part of the key range is called a L1 file. The "level" of
@@ -168,7 +177,9 @@ version, and how branching and GC works is still valid.
The full path of a delta file looks like this:
```
.zenith/tenants/941ddc8604413b88b3d208bddf90396c/timelines/4af489b06af8eed9e27a841775616962/rel_1663_13990_2609_0_10_000000000169C348_0000000001702000
```
For simplicity, the examples below use a simplified notation for the
paths. The tenant ID is left out, the timeline ID is replaced with
@@ -177,8 +188,10 @@ with a human-readable table name. The LSNs are also shorter. For
example, a base image file at LSN 100 and a delta file between 100-200
for 'orders' table on 'main' branch is represented like this:
```
main/orders_100
main/orders_100_200
```
# Creating layer files
@@ -188,12 +201,14 @@ branch called 'main' and two tables, 'orders' and 'customers'. The end
of WAL is currently at LSN 250. In this starting situation, you would
have these files on disk:
```
main/orders_100
main/orders_100_200
main/orders_200
main/customers_100
main/customers_100_200
main/customers_200
```
In addition to those files, the recent changes between LSN 200 and the
end of WAL at 250 are kept in memory. If the page server crashes, the
@@ -224,6 +239,7 @@ If the customers table is modified later, a new file is created for it
at the next checkpoint. The new file will cover the "gap" from the
last layer file, so the LSN ranges are always contiguous:
```
main/orders_100
main/orders_100_200
main/orders_200
@@ -236,6 +252,7 @@ last layer file, so the LSN ranges are always contiguous:
main/customers_200
main/customers_200_500
main/customers_500
```
## Reading page versions
@@ -259,15 +276,18 @@ involves replaying any WAL records applicable to the page between LSNs
Imagine that a child branch is created at LSN 250:
```
@250
----main--+-------------------------->
\
+---child-------------->
```
Then, the 'orders' table is updated differently on the 'main' and
'child' branches. You now have this situation on disk:
```
main/orders_100
main/orders_100_200
main/orders_200
@@ -282,6 +302,7 @@ Then, the 'orders' table is updated differently on the 'main' and
child/orders_300
child/orders_300_400
child/orders_400
```
Because the 'customers' table hasn't been modified on the child
branch, there is no file for it there. If you request a page for it on
@@ -294,6 +315,7 @@ is linear, and the request's LSN identifies unambiguously which file
you need to look at. For example, the history for the 'orders' table
on the 'main' branch consists of these files:
```
main/orders_100
main/orders_100_200
main/orders_200
@@ -301,10 +323,12 @@ on the 'main' branch consists of these files:
main/orders_300
main/orders_300_400
main/orders_400
```
And from the 'child' branch's point of view, it consists of these
files:
```
main/orders_100
main/orders_100_200
main/orders_200
@@ -313,6 +337,7 @@ files:
child/orders_300
child/orders_300_400
child/orders_400
```
The branch metadata includes the point where the child branch was
created, LSN 250. If a page request comes with LSN 275, we read the
@@ -345,6 +370,7 @@ Let's look at the single branch scenario again. Imagine that the end
of the branch is LSN 525, so that the GC horizon is currently at
525-150 = 375
```
main/orders_100
main/orders_100_200
main/orders_200
@@ -357,11 +383,13 @@ of the branch is LSN 525, so that the GC horizon is currently at
main/customers_100
main/customers_100_200
main/customers_200
```
We can remove the following files because the end LSNs of those files are
older than GC horizon 375, and there are more recent layer files for the
table:
```
main/orders_100 DELETE
main/orders_100_200 DELETE
main/orders_200 DELETE
@@ -374,8 +402,9 @@ table:
main/customers_100 DELETE
main/customers_100_200 DELETE
main/customers_200 KEEP, NO NEWER VERSION
```
'main/customers_100_200' is old enough, but it cannot be
'main/customers_200' is old enough, but it cannot be
removed because there is no newer layer file for the table.
Things get slightly more complicated with multiple branches. All of
@@ -384,6 +413,7 @@ retain older shapshot files that are still needed by child branches.
For example, if child branch is created at LSN 150, and the 'customers'
table is updated on the branch, you would have these files:
```
main/orders_100 KEEP, NEEDED BY child BRANCH
main/orders_100_200 KEEP, NEEDED BY child BRANCH
main/orders_200 DELETE
@@ -398,6 +428,7 @@ table is updated on the branch, you would have these files:
main/customers_200 KEEP, NO NEWER VERSION
child/customers_150_300 DELETE
child/customers_300 KEEP, NO NEWER VERSION
```
In this situation, 'main/orders_100' and 'main/orders_100_200' cannot
be removed, even though they are older than the GC horizon, because
@@ -407,6 +438,7 @@ and 'main/orders_200_300' can still be removed.
If 'orders' is modified later on the 'child' branch, we will create a
new base image and delta file for it on the child:
```
main/orders_100
main/orders_100_200
@@ -419,6 +451,7 @@ new base image and delta file for it on the child:
child/customers_300
child/orders_150_400
child/orders_400
```
After this, the 'main/orders_100' and 'main/orders_100_200' file could
be removed. It is no longer needed by the child branch, because there
@@ -434,6 +467,7 @@ Describe GC and checkpoint interval settings.
In principle, each relation can be checkpointed separately, i.e. the
LSN ranges of the files don't need to line up. So this would be legal:
```
main/orders_100
main/orders_100_200
main/orders_200
@@ -446,6 +480,7 @@ LSN ranges of the files don't need to line up. So this would be legal:
main/customers_250
main/customers_250_500
main/customers_500
```
However, the code currently always checkpoints all relations together.
So that situation doesn't arise in practice.
@@ -468,11 +503,13 @@ does that. It could be useful, however, as a transient state when
garbage collecting around branch points, or explicit recovery
points. For example, if we start with this:
```
main/orders_100
main/orders_100_200
main/orders_200
main/orders_200_300
main/orders_300
```
And there is a branch or explicit recovery point at LSN 150, we could
replace 'main/orders_100_200' with 'main/orders_150' to keep a

View File

@@ -34,7 +34,7 @@ pub trait BlobCursor {
) -> Result<(), std::io::Error>;
}
impl<'a, R> BlobCursor for BlockCursor<R>
impl<R> BlobCursor for BlockCursor<R>
where
R: BlockReader,
{

View File

@@ -37,11 +37,8 @@ use crate::virtual_file::VirtualFile;
use crate::walrecord;
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use tracing::*;
// avoid binding to Write (conflicts with std::io::Write)
// while being able to use std::fmt::Write's methods
use std::fmt::Write as _;
use std::fs;
use std::io::{BufWriter, Write};
use std::io::{Seek, SeekFrom};
@@ -49,6 +46,7 @@ use std::ops::Range;
use std::os::unix::fs::FileExt;
use std::path::{Path, PathBuf};
use std::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard};
use tracing::*;
use utils::{
bin_ser::BeSer,
@@ -218,6 +216,10 @@ impl Layer for DeltaLayer {
PathBuf::from(self.layer_name().to_string())
}
fn local_path(&self) -> Option<PathBuf> {
Some(self.path())
}
fn get_value_reconstruct_data(
&self,
key: Key,
@@ -250,6 +252,9 @@ impl Layer for DeltaLayer {
return false;
}
let entry_lsn = DeltaKey::extract_lsn_from_buf(key);
if entry_lsn < lsn_range.start {
return false;
}
offsets.push((entry_lsn, blob_ref.pos()));
!blob_ref.will_init()
@@ -258,8 +263,18 @@ impl Layer for DeltaLayer {
// Ok, 'offsets' now contains the offsets of all the entries we need to read
let mut cursor = file.block_cursor();
for (entry_lsn, pos) in offsets {
let buf = cursor.read_blob(pos)?;
let val = Value::des(&buf)?;
let buf = cursor.read_blob(pos).with_context(|| {
format!(
"Failed to read blob from virtual file {}",
file.file.path.display()
)
})?;
let val = Value::des(&buf).with_context(|| {
format!(
"Failed to deserialize file blob from virtual file {}",
file.file.path.display()
)
})?;
match val {
Value::Image(img) => {
reconstruct_state.img = Some((entry_lsn, img));
@@ -348,6 +363,28 @@ impl Layer for DeltaLayer {
tree_reader.dump()?;
let mut cursor = file.block_cursor();
// A subroutine to dump a single blob
let mut dump_blob = |blob_ref: BlobRef| -> anyhow::Result<String> {
let buf = cursor.read_blob(blob_ref.pos())?;
let val = Value::des(&buf)?;
let desc = match val {
Value::Image(img) => {
format!(" img {} bytes", img.len())
}
Value::WalRecord(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec)?;
format!(
" rec {} bytes will_init: {} {}",
buf.len(),
rec.will_init(),
wal_desc
)
}
};
Ok(desc)
};
tree_reader.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
@@ -356,34 +393,10 @@ impl Layer for DeltaLayer {
let key = DeltaKey::extract_key_from_buf(delta_key);
let lsn = DeltaKey::extract_lsn_from_buf(delta_key);
let mut desc = String::new();
match cursor.read_blob(blob_ref.pos()) {
Ok(buf) => {
let val = Value::des(&buf);
match val {
Ok(Value::Image(img)) => {
write!(&mut desc, " img {} bytes", img.len()).unwrap();
}
Ok(Value::WalRecord(rec)) => {
let wal_desc = walrecord::describe_wal_record(&rec);
write!(
&mut desc,
" rec {} bytes will_init: {} {}",
buf.len(),
rec.will_init(),
wal_desc
)
.unwrap();
}
Err(err) => {
write!(&mut desc, " DESERIALIZATION ERROR: {}", err).unwrap();
}
}
}
Err(err) => {
write!(&mut desc, " READ ERROR: {}", err).unwrap();
}
}
let desc = match dump_blob(blob_ref) {
Ok(desc) => desc,
Err(err) => format!("ERROR: {}", err),
};
println!(" key {} at {}: {}", key, lsn, desc);
true
},
@@ -408,6 +421,28 @@ impl DeltaLayer {
}
}
fn temp_path_for(
conf: &PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
key_start: Key,
lsn_range: &Range<Lsn>,
) -> PathBuf {
let rand_string: String = rand::thread_rng()
.sample_iter(&Alphanumeric)
.take(8)
.map(char::from)
.collect();
conf.timeline_path(&timelineid, &tenantid).join(format!(
"{}-XXX__{:016X}-{:016X}.{}.temp",
key_start,
u64::from(lsn_range.start),
u64::from(lsn_range.end),
rand_string
))
}
///
/// Open the underlying file and read the metadata into memory, if it's
/// not loaded already.
@@ -595,12 +630,8 @@ impl DeltaLayerWriter {
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let path = conf.timeline_path(&timelineid, &tenantid).join(format!(
"{}-XXX__{:016X}-{:016X}.temp",
key_start,
u64::from(lsn_range.start),
u64::from(lsn_range.end)
));
let path = DeltaLayer::temp_path_for(conf, timelineid, tenantid, key_start, &lsn_range);
let mut file = VirtualFile::create(&path)?;
// make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64))?;
@@ -693,6 +724,8 @@ impl DeltaLayerWriter {
}),
};
// fsync the file
file.sync_all()?;
// Rename the file to its final name
//
// Note: This overwrites any existing file. There shouldn't be any.

View File

@@ -11,7 +11,6 @@
//! - page-oriented
//!
//! TODO:
//! - better errors (e.g. with thiserror?)
//! - maybe something like an Adaptive Radix Tree would be more efficient?
//! - the values stored by image and delta layers are offsets into the file,
//! and they are in monotonically increasing order. Prefix compression would
@@ -19,11 +18,12 @@
//! - An Iterator interface would be more convenient for the callers than the
//! 'visit' function
//!
use anyhow;
use byteorder::{ReadBytesExt, BE};
use bytes::{BufMut, Bytes, BytesMut};
use hex;
use std::cmp::Ordering;
use std::{cmp::Ordering, io, result};
use thiserror::Error;
use tracing::error;
use crate::layered_repository::block_io::{BlockReader, BlockWriter};
@@ -86,6 +86,23 @@ impl Value {
}
}
#[derive(Error, Debug)]
pub enum DiskBtreeError {
#[error("Attempt to append a value that is too large {0} > {}", MAX_VALUE)]
AppendOverflow(u64),
#[error("Unsorted input: key {key:?} is <= last_key {last_key:?}")]
UnsortedInput { key: Box<[u8]>, last_key: Box<[u8]> },
#[error("Could not push to new leaf node")]
FailedToPushToNewLeafNode,
#[error("IoError: {0}")]
Io(#[from] io::Error),
}
pub type Result<T> = result::Result<T, DiskBtreeError>;
/// This is the on-disk representation.
struct OnDiskNode<'a, const L: usize> {
// Fixed-width fields
@@ -106,12 +123,12 @@ impl<'a, const L: usize> OnDiskNode<'a, L> {
///
/// Interpret a PAGE_SZ page as a node.
///
fn deparse(buf: &[u8]) -> OnDiskNode<L> {
fn deparse(buf: &[u8]) -> Result<OnDiskNode<L>> {
let mut cursor = std::io::Cursor::new(buf);
let num_children = cursor.read_u16::<BE>().unwrap();
let level = cursor.read_u8().unwrap();
let prefix_len = cursor.read_u8().unwrap();
let suffix_len = cursor.read_u8().unwrap();
let num_children = cursor.read_u16::<BE>()?;
let level = cursor.read_u8()?;
let prefix_len = cursor.read_u8()?;
let suffix_len = cursor.read_u8()?;
let mut off = cursor.position();
let prefix_off = off as usize;
@@ -129,7 +146,7 @@ impl<'a, const L: usize> OnDiskNode<'a, L> {
let keys = &buf[keys_off..keys_off + keys_len];
let values = &buf[values_off..values_off + values_len];
OnDiskNode {
Ok(OnDiskNode {
num_children,
level,
prefix_len,
@@ -137,7 +154,7 @@ impl<'a, const L: usize> OnDiskNode<'a, L> {
prefix,
keys,
values,
}
})
}
///
@@ -149,7 +166,11 @@ impl<'a, const L: usize> OnDiskNode<'a, L> {
Value::from_slice(value_slice)
}
fn binary_search(&self, search_key: &[u8; L], keybuf: &mut [u8]) -> Result<usize, usize> {
fn binary_search(
&self,
search_key: &[u8; L],
keybuf: &mut [u8],
) -> result::Result<usize, usize> {
let mut size = self.num_children as usize;
let mut low = 0;
let mut high = size;
@@ -209,7 +230,7 @@ where
///
/// Read the value for given key. Returns the value, or None if it doesn't exist.
///
pub fn get(&self, search_key: &[u8; L]) -> anyhow::Result<Option<u64>> {
pub fn get(&self, search_key: &[u8; L]) -> Result<Option<u64>> {
let mut result: Option<u64> = None;
self.visit(search_key, VisitDirection::Forwards, |key, value| {
if key == search_key {
@@ -230,7 +251,7 @@ where
search_key: &[u8; L],
dir: VisitDirection,
mut visitor: V,
) -> anyhow::Result<bool>
) -> Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
@@ -243,7 +264,7 @@ where
search_key: &[u8; L],
dir: VisitDirection,
visitor: &mut V,
) -> anyhow::Result<bool>
) -> Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
@@ -260,11 +281,11 @@ where
search_key: &[u8; L],
dir: VisitDirection,
visitor: &mut V,
) -> anyhow::Result<bool>
) -> Result<bool>
where
V: FnMut(&[u8], u64) -> bool,
{
let node = OnDiskNode::deparse(node_buf);
let node = OnDiskNode::deparse(node_buf)?;
let prefix_len = node.prefix_len as usize;
let suffix_len = node.suffix_len as usize;
@@ -369,15 +390,15 @@ where
}
#[allow(dead_code)]
pub fn dump(&self) -> anyhow::Result<()> {
pub fn dump(&self) -> Result<()> {
self.dump_recurse(self.root_blk, &[], 0)
}
fn dump_recurse(&self, blknum: u32, path: &[u8], depth: usize) -> anyhow::Result<()> {
fn dump_recurse(&self, blknum: u32, path: &[u8], depth: usize) -> Result<()> {
let blk = self.reader.read_blk(self.start_blk + blknum)?;
let buf: &[u8] = blk.as_ref();
let node = OnDiskNode::<L>::deparse(buf);
let node = OnDiskNode::<L>::deparse(buf)?;
print!("{:indent$}", "", indent = depth * 2);
println!(
@@ -423,6 +444,13 @@ where
///
/// stack[0] is the current root page, stack.last() is the leaf.
///
/// We maintain the length of the stack to be always greater than zero.
/// Two exceptions are:
/// 1. `Self::flush_node`. The method will push the new node if it extracted the last one.
/// So because other methods cannot see the intermediate state invariant still holds.
/// 2. `Self::finish`. It consumes self and does not return it back,
/// which means that this is where the structure is destroyed.
/// Thus stack of zero length cannot be observed by other methods.
stack: Vec<BuildNode<L>>,
/// Last key that was appended to the tree. Used to sanity check that append
@@ -442,19 +470,29 @@ where
}
}
pub fn append(&mut self, key: &[u8; L], value: u64) -> Result<(), anyhow::Error> {
assert!(value <= MAX_VALUE);
pub fn append(&mut self, key: &[u8; L], value: u64) -> Result<()> {
if value > MAX_VALUE {
return Err(DiskBtreeError::AppendOverflow(value));
}
if let Some(last_key) = &self.last_key {
assert!(key > last_key, "unsorted input");
if key <= last_key {
return Err(DiskBtreeError::UnsortedInput {
key: key.as_slice().into(),
last_key: last_key.as_slice().into(),
});
}
}
self.last_key = Some(*key);
Ok(self.append_internal(key, Value::from_u64(value))?)
self.append_internal(key, Value::from_u64(value))
}
fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<(), std::io::Error> {
fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<()> {
// Try to append to the current leaf buffer
let last = self.stack.last_mut().unwrap();
let last = self
.stack
.last_mut()
.expect("should always have at least one item");
let level = last.level;
if last.push(key, value) {
return Ok(());
@@ -476,26 +514,33 @@ where
// key to it.
let mut last = BuildNode::new(level);
if !last.push(key, value) {
panic!("could not push to new leaf node");
return Err(DiskBtreeError::FailedToPushToNewLeafNode);
}
self.stack.push(last);
Ok(())
}
fn flush_node(&mut self) -> Result<(), std::io::Error> {
let last = self.stack.pop().unwrap();
/// Flush the bottommost node in the stack to disk. Appends a downlink to its parent,
/// and recursively flushes the parent too, if it becomes full. If the root page becomes full,
/// creates a new root page, increasing the height of the tree.
fn flush_node(&mut self) -> Result<()> {
// Get the current bottommost node in the stack and flush it to disk.
let last = self
.stack
.pop()
.expect("should always have at least one item");
let buf = last.pack();
let downlink_key = last.first_key();
let downlink_ptr = self.writer.write_blk(buf)?;
// Append the downlink to the parent
// Append the downlink to the parent. If there is no parent, ie. this was the root page,
// create a new root page, increasing the height of the tree.
if self.stack.is_empty() {
self.stack.push(BuildNode::new(last.level + 1));
}
self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr))?;
Ok(())
self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr))
}
///
@@ -505,13 +550,16 @@ where
/// (In the image and delta layers, it is stored in the beginning of the file,
/// in the summary header)
///
pub fn finish(mut self) -> Result<(u32, W), std::io::Error> {
pub fn finish(mut self) -> Result<(u32, W)> {
// flush all levels, except the root.
while self.stack.len() > 1 {
self.flush_node()?;
}
let root = self.stack.first().unwrap();
let root = self
.stack
.first()
.expect("by the check above we left one item there");
let buf = root.pack();
let root_blknum = self.writer.write_blk(buf)?;
@@ -692,14 +740,14 @@ mod tests {
impl BlockReader for TestDisk {
type BlockLease = std::rc::Rc<[u8; PAGE_SZ]>;
fn read_blk(&self, blknum: u32) -> Result<Self::BlockLease, std::io::Error> {
fn read_blk(&self, blknum: u32) -> io::Result<Self::BlockLease> {
let mut buf = [0u8; PAGE_SZ];
buf.copy_from_slice(&self.blocks[blknum as usize]);
Ok(std::rc::Rc::new(buf))
}
}
impl BlockWriter for &mut TestDisk {
fn write_blk(&mut self, buf: Bytes) -> Result<u32, std::io::Error> {
fn write_blk(&mut self, buf: Bytes) -> io::Result<u32> {
let blknum = self.blocks.len();
self.blocks.push(buf);
Ok(blknum as u32)
@@ -707,7 +755,7 @@ mod tests {
}
#[test]
fn basic() -> anyhow::Result<()> {
fn basic() -> Result<()> {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 6>::new(&mut disk);
@@ -788,7 +836,7 @@ mod tests {
}
#[test]
fn lots_of_keys() -> anyhow::Result<()> {
fn lots_of_keys() -> Result<()> {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 8>::new(&mut disk);
@@ -882,7 +930,7 @@ mod tests {
}
#[test]
fn random_data() -> anyhow::Result<()> {
fn random_data() -> Result<()> {
// Generate random keys with exponential distribution, to
// exercise the prefix compression
const NUM_KEYS: usize = 100000;
@@ -927,21 +975,27 @@ mod tests {
}
#[test]
#[should_panic(expected = "unsorted input")]
fn unsorted_input() {
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 2>::new(&mut disk);
let _ = writer.append(b"ba", 1);
let _ = writer.append(b"bb", 2);
let _ = writer.append(b"aa", 3);
let err = writer.append(b"aa", 3).expect_err("should've failed");
match err {
DiskBtreeError::UnsortedInput { key, last_key } => {
assert_eq!(key.as_ref(), b"aa".as_slice());
assert_eq!(last_key.as_ref(), b"bb".as_slice());
}
_ => panic!("unexpected error variant, expected DiskBtreeError::UnsortedInput"),
}
}
///
/// This test contains a particular data set, see disk_btree_test_data.rs
///
#[test]
fn particular_data() -> anyhow::Result<()> {
fn particular_data() -> Result<()> {
// Build a tree from it
let mut disk = TestDisk::new();
let mut writer = DiskBtreeBuilder::<_, 26>::new(&mut disk);

View File

@@ -34,6 +34,7 @@ use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
use hex;
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use std::fs;
use std::io::Write;
@@ -125,6 +126,10 @@ impl Layer for ImageLayer {
PathBuf::from(self.layer_name().to_string())
}
fn local_path(&self) -> Option<PathBuf> {
Some(self.path())
}
fn get_tenant_id(&self) -> ZTenantId {
self.tenantid
}
@@ -237,6 +242,22 @@ impl ImageLayer {
}
}
fn temp_path_for(
conf: &PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
fname: &ImageFileName,
) -> PathBuf {
let rand_string: String = rand::thread_rng()
.sample_iter(&Alphanumeric)
.take(8)
.map(char::from)
.collect();
conf.timeline_path(&timelineid, &tenantid)
.join(format!("{}.{}.temp", fname, rand_string))
}
///
/// Open the underlying file and read the metadata into memory, if it's
/// not loaded already.
@@ -394,7 +415,7 @@ impl ImageLayer {
///
pub struct ImageLayerWriter {
conf: &'static PageServerConf,
_path: PathBuf,
path: PathBuf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
key_range: Range<Key>,
@@ -412,12 +433,10 @@ impl ImageLayerWriter {
key_range: &Range<Key>,
lsn: Lsn,
) -> anyhow::Result<ImageLayerWriter> {
// Create the file
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let path = ImageLayer::path_for(
&PathOrConf::Conf(conf),
// Create the file initially with a temporary filename.
// We'll atomically rename it to the final name when we're done.
let path = ImageLayer::temp_path_for(
conf,
timelineid,
tenantid,
&ImageFileName {
@@ -437,7 +456,7 @@ impl ImageLayerWriter {
let writer = ImageLayerWriter {
conf,
_path: path,
path,
timelineid,
tenantid,
key_range: key_range.clone(),
@@ -508,6 +527,25 @@ impl ImageLayerWriter {
index_root_blk,
}),
};
// fsync the file
file.sync_all()?;
// Rename the file to its final name
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let final_path = ImageLayer::path_for(
&PathOrConf::Conf(self.conf),
self.timelineid,
self.tenantid,
&ImageFileName {
key_range: self.key_range.clone(),
lsn: self.lsn,
},
);
std::fs::rename(self.path, &final_path)?;
trace!("created image layer {}", layer.path().display());
Ok(layer)

Some files were not shown because too many files have changed in this diff Show More