Compare commits

..

334 Commits

Author SHA1 Message Date
Konstantin Knizhnik
24b1b412ee Launch multiple wal-redo postgres instances 2021-12-08 13:02:57 +03:00
Konstantin Knizhnik
0ce4dca05a Diable relish upload and backpressure 2021-12-08 10:30:45 +03:00
Konstantin Knizhnik
1530143e00 Merge branch 'main' into main_local 2021-12-06 17:22:20 +03:00
Konstantin Knizhnik
f77c9c987f Use different default values 2021-12-06 17:21:48 +03:00
Dmitry Ivanov
0a8c672630 [CI] Fix benchmarks
Too bad we don't have a --dry-run in PRs :(
2021-12-06 13:52:28 +03:00
Dmitry Ivanov
b87ab17d05 Bump rust version to 1.56.1
Apparently, code coverage doesn't work that well in 1.55.
2021-12-06 13:27:52 +03:00
Dmitry Ivanov
d874675955 Collect coverage in CI 2021-12-06 13:27:52 +03:00
Dmitry Ivanov
5d37560308 Add bespoke glue script leveraging LLVM coverage tools 2021-12-06 13:27:52 +03:00
Dmitry Ivanov
7cec13d1df Improve shutdown story for code coverage
This patch introduces fixes for several problems affecting
LLVM-based code coverage:

* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.

* Implement proper shutdown handlers in safekeeper.
2021-12-06 13:27:52 +03:00
anastasia
b7685eb6ba Enable backpressure 2021-12-06 12:49:42 +03:00
anastasia
c7f3b4e62c Clarify the meaning of StandbyReply LSNs:
write_lsn - The last LSN received and processed by pageserver's walreceiver.
flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers.
apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).
2021-12-06 12:49:42 +03:00
Heikki Linnakangas
5bad2deff8 Don't hold 'timelines' lock over checkpoint.
It was very noticeable that you while the checkpointer was busy, you
could not e.g. open a new connection.
2021-12-03 07:42:10 -05:00
Arseny Sher
d39608c367 Fix passing start_offset to find_end_of_wal_segment. 2021-12-03 12:43:57 +03:00
Arseny Sher
cba4da3f4d Add term history to safekeepers.
Persist full history of term switches on safekeepers instead of storing only the
single term of the highest entry (called epoch). This allows easily and
correctly find the divergence point of two logs and truncate the obsolete part
before overwriting it with entries of the newer proposer(s).

Full history of the proposer is transferred in separate message before proposer
starts streaming; it is immediately persisted by safekeeper, though he might not
yet have entries for some older terms there. That's because we can't atomically
append to WAL and update the control file anyway, so locally available WAL must
be taken into account when looking at the history.

We should sometimes purge term history entries beyond truncate_lsn; this is not
done here.

Per https://github.com/zenithdb/rfcs/pull/12

Closes #296.

Bumps vendor/postgres.
2021-12-03 12:43:57 +03:00
Dmitry Rodionov
2669d140f8 use full commit sha for version info
for builds in docker this is not needed, since environment variable
with commit sha already contains full version
2021-12-01 17:35:57 +03:00
Heikki Linnakangas
f49ad33f1b Initialize 'loaded' correctly in DeltaLayer.
While we're at it, reuse the Book and the VirtualFile that's backing
it even over unload() calls. Previously, we would keep the Book open,
but on load(), we would re-open it anyway, which didn't make much
sense. Now we reuse it it. Alternatively, perhaps we should close it
on unload() to save some memory, but this I'm not going to think too
hard about it right now as the whole load/unload thing is a bit of a
hack and needs to be rewritten.

This is hard to reproduce ATM, because the incorrect state would get
fixed by an unload(). A checkpoint creates the DeltaLayer, and it also
calls unload() afterwards, so the window is not very large. I hit it
occasionally with a scale 1000 pgbench test, after I had modified
InMemoryLayer::write_to_disk() to not write an image layer every time,
which made the DeltaLayers be accessed more often.
2021-11-30 22:23:59 +02:00
Kirill Bulatov
670205e17a Evict excessively failing sync tasks, improve processing for the rest of
the tasks
2021-11-30 13:58:49 +02:00
Konstantin Knizhnik
f72d4814b1 Extract page images from FPI WAL records (#949)
* Extract page images from FPI WAL records

* Fix issues reported in review
2021-11-30 12:57:26 +03:00
Heikki Linnakangas
5ecf0664cc Fix off-by-one error in check for future delta layers.
This doesnt show up at the moment, because we never create a delta
layer with end-LSN equal to the last LSN. We always create an image
layer at that LSN instead. For example, if the latest processed LSN is
100, we would create a delta layer with end LSN 100 (exclusive), and
an image layer at 100. But that's just how InMemoryLayer::write_to_disk
happens to work at the moment, there's no fundamental reason it needs
to always create that image layer. I noticed this bug when I tried to
change the logic in InMemoryLayer::write_to_disk to only create an
image layer after a few delta layers.
2021-11-29 14:35:24 +02:00
Heikki Linnakangas
7cae265447 Fix dump_layerfile.
The VirtualFile machinery panics if it's not initialized
2021-11-29 11:26:54 +02:00
Heikki Linnakangas
5aa969a588 Replace in-memory layers and OOM-triggered eviction with temp files.
The "in-memory layer" is misnomer now, each in-memory layer is now actually
backed by a file. The files are ephemeral, in that they don't survive page
server crash or shutdown.

To avoid reading the file for every operation,
"ephemeral files" are cached in a page cache.

This includes changes from 'inmemory-layer-chunks' branch to serialize /
the page versions when they are added to the open layer. The difference is
that they are not serialized to the expandable in-memory "chunk buffer", but
written out to the file.
2021-11-26 17:25:17 +03:00
Arthur Petukhovsky
93cc40584d Shutdown socket on CopyFail (#938)
Fixes #935
2021-11-26 16:48:27 +03:00
Dmitry Rodionov
130184fee9 Prohibit branch creation and basebackup at out of scope lsns
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.

To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
  when basebackup retured an error. Previously pageserver thread just
  died.
* Remove "ancestor" file which previously contained ancestor id and
  branch lsn. Currently the same data can be obtained from metadata file.
  And just the way we handled ancestor file in the code introduced the
  case when branching fails timeline directory is created but there is no data in it
  except ancestor file. And this confused gc because it scans
  directories. So it is better to just remove ancestor file and clean up
  this timeline directory creation so it happens after all validity
  checks have passed
2021-11-25 15:27:16 +03:00
Heikki Linnakangas
d47f610606 Fix pageserver CLI parameter names and document them 2021-11-25 13:31:52 +02:00
Dmitry Rodionov
0650e51f0b add test one more case for layer visibility 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
737a557f09 add check to python tests that afteer gc number of rows is unchanged in all branches 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
6f7ebe6e01 preserve data in parent branch that might be referenced in child branch 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
70ab0d5b1f add missing script 2021-11-19 00:10:40 +03:00
Dmitry Rodionov
6ac76248cf Save performance test results from perfirmance test suit runs.
Also render reports for both staging and local runs.
2021-11-19 00:00:19 +03:00
Kirill Bulatov
b32da3b42e Use less pageserver-specific method in RemoteStorage trait 2021-11-18 22:53:40 +02:00
Dmitry Ivanov
0ccfc62e88 [proxy] Pass PostgreSQL version to client
Fixes #779
2021-11-17 16:28:44 +03:00
Dmitry Ivanov
b55cf773a8 [proxy] Streamline control- and dataflow 2021-11-17 16:28:44 +03:00
Dmitry Ivanov
43ded1c54b [proxy] Minor cleanup 2021-11-17 16:28:44 +03:00
Heikki Linnakangas
f8702d4625 Fix checking for whether segment exists on a frozen in-memory layer.
Ever since we've had frozen in-memory layers, having an 'end_lsn' no
longer means that the layer has been dropped. Need to check the 'dropped'
flag explicitly.

This was reliably causing a failure on the new 'test_parallel_copy' test
in https://github.com/zenithdb/zenith/pull/864. I'm not sure why it
doesn't happen on main branch, but the bug is pretty straightforward when
you see it.
2021-11-15 20:19:15 +02:00
Dmitry Rodionov
44111e3ba3 Prohibit branch creation at lsn that was already garbage collected.
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.
2021-11-15 20:03:16 +03:00
Patrick Insinger
298bc588f9 pageserver - don't try to GC InMemoryLayers 2021-11-15 09:01:45 -08:00
Heikki Linnakangas
4ba521f53f Add performance test case for parallel COPY TO 2021-11-15 14:49:53 +02:00
Heikki Linnakangas
431d32756b Add a buffer cache, and use it to store materialized pages.
The buffer cache is shared across all tenants, allowing memory to be
dynamically allocated where it's needed the most. The cache works on 8 kB
pages, and uses the clock algorithm for replacement policy; same as the
PostgreSQL buffer cache.

One peculiarity is that the materialized page versions can be looked up
by an inexact LSN, to find the latest page version with an LSN >= the
search key.

The code is structured to support caching other kinds of pages in the same
cache in the future, but with a different mapping key.

Co-authored-by: Patrick Insinger <patrick@zenith.tech>
2021-11-12 11:02:12 -08:00
Heikki Linnakangas
3d172d98a3 Improve layered repo README.
Add an informal overview of how it works.
2021-11-12 19:59:31 +02:00
Heikki Linnakangas
849ac791a6 Bandaid fix for "page not found" errors, when a table is loaded.
During parallel load of a table, Postgres sometimes requests a page from
the page server for which no WAL has been generated yet. That's normal;
Postgres expects the page to be full of zeros. There was a special case
for that in LayeredTimeline::materialize_page, but the problem remained
when you're crossing a segment boundary, so that there's no layer for
the segment at all.

It would be nice to have a more robust cross-check for this case. That
might need help from the Postgres side. But this extends the bandaid fix
we had in materialize_page() to the case where cross segment boundary.

Fixes https://github.com/zenithdb/zenith/issues/841
2021-11-12 18:47:39 +02:00
Alexey Kondratov
de5e6a15ae Set LD_LIBRARY_PATH in the check_restored_datadir_content() psql call
Otherwise we may use outdated system libpq.
Also print stdout/stderr if basebackup failed in check_restored_datadir_content()
2021-11-12 16:27:43 +03:00
Alexey Kondratov
0d6bf14ecb Use vendor/postgres rebased on top of REL_14_1 2021-11-12 16:27:43 +03:00
Heikki Linnakangas
d1e79c4af3 Fix locking issues in VirtualFile machinery.
There were two separate locking issues that could lead to a deadlock,
both related to holding a lock for longer than necessary:

1. In the loop in `VirtualFile::with_file`, the "handle_guard" was
held across iterations of the loop. Because of that, if the handle was
changed by a concurrent thread, the loop would try to acquire the
handle lock, when it was still holding the lock from previous
iteration. To fix, release the lock earlier. There was no need to hold
it across iterations, it was just accidental.

2. In the same function, we also held the "slot_guard" longer than
necessary. It's only needed in the first part of the loop, where we
check if the current handle is valid. If it's not, the slot lock can
be immediately released. But it was not, it was kept over the
acquisition of the handle lock. I'm not sure if that alone could cause
problems, but let's release the lock as soon as possible anyway.

Add a test case, based on Konstantin's test program to demonstrate the
deadlock.
2021-11-11 20:12:59 +02:00
Kirill Bulatov
abb2ac5246 Better context when erroring 2021-11-11 19:22:05 +02:00
Kirill Bulatov
99dbbe5f18 Allow downloading remote files partially 2021-11-11 18:51:34 +02:00
Arseny Sher
e7ca8ef5a8 Use PG timelineid 1 everywhere.
As changing it doesn't have useful meaning in Zenith.

ref #824
2021-11-11 13:53:39 +03:00
Patrick Insinger
1ce4976e36 pageserver - track size of VecMaps 2021-11-10 11:09:34 -08:00
Heikki Linnakangas
9300107cdf Cache Book objects, use virtual files to avoid running out of fds.
Currently, whenever a page version is needed from an image or delta
layer, we open the file and read and parse the bookfile headers. That's
pretty expensive. To reduce the overhead, introduce a cache of open file
descriptors, and use that to cache the Book objects so that we don't need
to read the metadata on every access.
2021-11-10 17:19:37 +02:00
Arthur Petukhovsky
9aaa02bc9a Fix high CPU usage in walproposer (#860)
* Bump vendor/postgres

* Update time limits for test_restarts_under_load
2021-11-10 17:18:07 +03:00
Arseny Sher
5603259c53 In wal_proposer_recovery, don't wait outcoming WAL to be committed.
Otherwise we're deadlocking ourselves. Oversight of 33007cc.
2021-11-10 01:38:25 +03:00
Arseny Sher
ce15c62f35 Fix 'send WAL up to' debug logging. 2021-11-10 01:38:25 +03:00
Egor Suvorov
eaff0cd568 Check python for the whole repository and improve docs (#813) 2021-11-09 22:23:29 +03:00
Egor Suvorov
587935ebed Add Safekeeper metrics tests (#746)
* zenith_fixtures.py: add SafekeeperHttpClient.get_metrics()
* Ensure that `collect_lsn` and `flush_lsn`'s reported values look reasonable in `test_many_timelines`
2021-11-09 22:18:59 +03:00
Dmitry Rodionov
07dddfed28 Use more robust way to persist safekeeper control file.
Now safekeeper control file updated in a following way:
1. Write data to temp file
2. Fsync the temporary file (if sync option is specified)
3. Rename temporary file to actual control file
4. Fsync containing directory (if sync option is specified)
5. Fsync file after rename (if sync option is specified).

Note that action 5 is not mentioned anywhere as required but it is done
in postgres this way (see durable_rename).

Also because of the rename machinery switch to use dedicated lock file
to prevent running several safekeepers concurrently on the same data

cleanup

fsync control file after rename to match postgres behaviour
2021-11-09 17:51:46 +03:00
Arseny Sher
229dc7704f Bump vendor/postgres. 2021-11-08 17:32:13 +03:00
Dmitry Rodionov
067f2ac814 fix perf repo branch name 2021-11-08 13:27:23 +03:00
Dmitry Rodionov
865870a8e5 Follow up staging benchmarking
* change zenith-perf-data checkout ref to be main
* set cluster id through secrets so there is no code changes required
  when we wipe out clusters on staging
* display full pgbench output on error
2021-11-05 14:07:11 +03:00
Arthur Petukhovsky
d19263aec8 Adjust timeouts for test_restarts_under_load (#830)
* Adjust timeouts for test_restarts_under_load

* Add test timeout for test_restarts_under_load
2021-11-04 19:58:40 +03:00
Heikki Linnakangas
6d742719a1 Fix infinite loop in looking up predecessor layer
Commit 960c7d69a8 changed the LSN returned in the Continue case in
InMemoryLayer::get_page_reconstruct_data(), but neglected to make the
same change in DeltaLayer.

Also add an escape hatch to the loop in materialize_page() to avoid
getting stuck in an infinite loop, if a bug like this reoccurs.
2021-11-04 16:07:12 +02:00
Dmitry Rodionov
c75bc9b8b0 Change benchmark plugin layout so pytest loads it properly when running
all tests (not necessary performance ones)

resolves #837
2021-11-04 16:33:31 +03:00
Egor Suvorov
33007cc0bb Safekeeper's START_REPLICATION handler: remove stop_point, do not handle start_point == 0 (#777) 2021-11-04 14:50:33 +03:00
Dmitry Rodionov
987833e0b9 Propagate git SHA to zenith binaries
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
2021-11-04 14:22:29 +03:00
Kirill Bulatov
f36acf00de Reduce "relish" word usages in remote storage 2021-11-04 12:53:42 +02:00
Kirill Bulatov
956fc3dec9 Tidy up and make consistent the remote storate API 2021-11-04 12:53:42 +02:00
Heikki Linnakangas
b38e841f2d Use poll() in communication with WAL redo process.
The tokio futures added some overhead, so switch to plain non-blocking
I/O with poll(). In a simple pgbench test on my laptop (select-only
queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10%
improvement, from about 4300 TPS to 4800 TPS.
2021-11-04 10:39:04 +02:00
Heikki Linnakangas
3a0111c75e Refactor functions for constructing WAL redo messages.
Instead of building a separate Vec<u8> to hold each message, serialize all
the messages to one big Vec<u8>. This eliminates some Vec allocation and
memcpy() overhead. The downside is that if there are a lot of records to
replay, we have to serialize them all into one big chunk of memory.
That shouldn't be a problem in practice. If you need to replay millions
of records to reconstruct a page, we should've materialized a new image
of that page earlier already.
2021-11-04 10:39:00 +02:00
Heikki Linnakangas
086a02ab92 Add performance test for simple seq scans.
Fixes https://github.com/zenithdb/zenith/issues/831
2021-11-04 10:36:45 +02:00
Heikki Linnakangas
7ed39655dc Bump vendor/postgres 2021-11-04 10:35:50 +02:00
Dmitry Rodionov
c6172dae47 implement performance tests against our staging environment
tests are based on self-hosted runner which is physically close
to our staging deployment in aws, currently tests consist of
various configurations of pgbenchi runs.

Also these changes rework benchmark fixture by removing globals and
allowing to collect reports with desired metrics and dump them to json
for further analysis. This is also applicable to usual performance tests
which use local zenith binaries.
2021-11-04 02:15:46 +03:00
Heikki Linnakangas
4ba783d0af Remove a couple of unused functions.
We might want to have custom serialize/deserialize functions for
WALRecords and PageVersions for performance reasons, see github issue 832.
But that would probably look a bit different from this, and currently
these functions are just dead.
2021-11-03 19:10:23 +02:00
Patrick Insinger
0457fe81a9 pageserver - make PageVersion an enum 2021-11-03 09:28:49 -07:00
Heikki Linnakangas
fb524dd973 Put a global limit on memory used by in-memory layers.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.

After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.

This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.

Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
2021-11-02 15:49:39 +02:00
Heikki Linnakangas
8c6d2664c0 Support removing arbitrary open layers, not just the oldest one 2021-11-02 15:43:16 +02:00
Patrick Insinger
cdbbd15eb9 pageserver - add InMemoryLayer global map (#817) 2021-11-01 12:20:24 -07:00
anastasia
85f8bf97f5 Name walkeeper threads to make debugging more convenient 2021-11-01 19:09:57 +03:00
anastasia
83ed930bc2 WIP. Launch and shutdown tenant threads together with walreceiver.
TODO: now walreceiver only disconnects if safekeeper was shut down. Implemnt proper walreceiver disconnection.
2021-11-01 18:04:00 +03:00
anastasia
071e30cc53 Expose TENANT_THREADS_COUNT metric to observe number of currently active checkpointer and GC threads 2021-11-01 18:04:00 +03:00
Kirill Bulatov
e6ef27637b Better API to handle timeline metadata properly 2021-10-29 23:51:40 +03:00
Patrick Insinger
b532470792 Set SO_REUSEADDR for all TCP listeners 2021-10-29 12:45:26 -07:00
Heikki Linnakangas
e0d7ecf91c Refactor 'zenith' CLI subcommand handling
Also fixes 'zenith safekeeper restart -m immediate'. The stop-mode was
previously ignored.
2021-10-29 19:01:01 +03:00
Kirill Bulatov
edba2e9744 Use a proper extension for the readme file 2021-10-28 18:55:14 +03:00
Egor Suvorov
7e552b645f Add disk write/sync metrics to Safekeeper (#745) 2021-10-28 18:38:36 +03:00
anastasia
ea5900f155 Refactoring of checkpointer and GC.
Move them to a separate tenant_threads module to detangle thread management from LayeredRepository implementation.
2021-10-27 20:50:26 +03:00
anastasia
28ab40c8b7 fix init_repo() call in register_relish_download() 2021-10-27 20:50:26 +03:00
Alexey Kondratov
d423142623 Proxy: wait for kick on .pgpass connection (zenithdb/console#227) 2021-10-27 20:24:23 +03:00
Dmitry Rodionov
1c0e85f9a0 review cleanups 2021-10-27 13:30:34 +03:00
Dmitry Rodionov
5bc09074ea add a flag to avoid non incremental size calculation in pageserver http api
This calculation is not that heavy but it is needed only in tests, and
in case the number of tenants/timelines is high the calculation can take
noticeable time.

Resolves https://github.com/zenithdb/zenith/issues/804
2021-10-27 13:30:34 +03:00
Heikki Linnakangas
1fac4a3c91 Fix a few messages.
Pointed out by Egor in https://github.com/zenithdb/zenith/pull/788,
but I accidentally pushed that before fixing these.
2021-10-27 10:58:21 +03:00
Heikki Linnakangas
1bc917324d Use -m immediate for 'immediate' shutdown 2021-10-27 10:49:38 +03:00
Heikki Linnakangas
af429fb401 Improve 'zenith' CLI utility for safekeepers and a config file.
The 'zenith' CLI utility can now be used to launch safekeepers. By
default, one safekeeper is configured. There are new 'safekeeper
start/stop' subcommands to manage the safekeepers. Each safekeeper is
given a name that can be used to identify the safekeeper to start/stop
with the 'zenith start/stop' commands. The safekeeper data is stored
in '.zenith/safekeepers/<name>'.

The 'zenith start' command now starts the pageserver and also all
safekeepers. 'zenith stop' stops pageserver, all safekeepers, and all
postgres nodes.

Introduce new 'zenith pageserver start/stop' subcommands for
starting/stopping just the page server.

The biggest change here is to the 'zenith init' command. This adds a
new 'zenith init --config=<path to toml file>' option. It takes a toml
config file that describes the environment. In the config file, you
can specify options for the pageserver, like the pg and http ports,
and authentication. For each safekeeper, you can define a name and the
pg and http ports. If you don't use the --config option, you get a
default configuration with a pageserver and one safekeeper. Note that
that's different from the previous default of no safekeepers.  Any
fields that are omitted in the configuration file are filled with
defaults. You can also specify the initial tenant ID in the config
file. A couple of sample config files are added in the control_plane/
directory.

The --pageserver-pg-port, --pageserver-http-port, and
--pageserver-auth options to 'zenith init' are removed. Use a config
file instead.

Finally, change the python test fixtures to use the new 'zenith'
commands and the config file to describe the environment.
2021-10-27 10:49:38 +03:00
Heikki Linnakangas
710fe02d0b Return success on 'zenith stop' if the page server is already stopped. 2021-10-27 01:10:24 +03:00
Heikki Linnakangas
de87aad990 Remove a few unused functions 2021-10-27 01:10:24 +03:00
Heikki Linnakangas
41d48719e1 In python tests, skip ports that are already in use.
We've seen some failures with "Address already in use" errors in the
tests. It's not clear why, perhaps some server processes are not cleaned
up properly after test, or maybe the socket is still in TIME_WAIT state.
In any case, let's make the tests more robust by checking that the port
is free, before trying to use it.
2021-10-27 00:46:24 +03:00
Kirill Bulatov
d88377f9f0 Remove log from zenith_utils 2021-10-26 23:24:11 +03:00
Kirill Bulatov
ecd577c934 Simplify tracing declarations 2021-10-26 23:24:11 +03:00
anastasia
f43f8401ee Don't wait for wal-redo process for non-relational records replay 2021-10-26 19:30:28 +03:00
Arseny Sher
1877bbc7cb bump vendor/postgres to fix reconnection busy loop 2021-10-26 15:43:19 +03:00
Heikki Linnakangas
a064ebb64c Cope with missing 'tenantid' in '.zenith/config' file.
We generate the initial tenantid and store it in the file, so it shouldn't
be missing. But let's cope with it. (This comes handy with the bigger
changes I'm working on at https://github.com/zenithdb/zenith/pull/788)
2021-10-25 21:24:11 +03:00
Heikki Linnakangas
4726870e8d Remove obsolete comment.
We store the pageserver port in the .zenith/config file.
2021-10-25 21:16:58 +03:00
Heikki Linnakangas
3bbc106c70 Prefer long CLI option name for clarity. 2021-10-25 21:16:58 +03:00
Heikki Linnakangas
66eb081876 Improve comment on 'base_dir' 2021-10-25 21:16:58 +03:00
Kirill Bulatov
f291ab2b87 Do not panic on missing tenant 2021-10-25 18:36:30 +03:00
Heikki Linnakangas
66ec135676 Refactor pytest fixtures
Instead of having a lot of separate fixtures for setting up the page
server, the compute nodes, the safekeepers etc., have one big ZenithEnv
object that encapsulates the whole environment. Every test either uses
a shared "zenith_simple_env" fixture, which contains the default setup
of a pageserver with no authentication, and no safekeepers. Tests that
want to use safekeepers or authentication set up a custom test-specific
ZenithEnv fixture.

Gathering information about the whole environment into one object makes
some things simpler. For example, when a new compute node is created,
you no longer need to pass the 'wal_acceptors' connection string as
argument to the 'postgres.create_start' function. The 'create_start'
function fetches that information directly from the ZenithEnv object.
2021-10-25 14:14:47 +03:00
Heikki Linnakangas
28af3e5008 Remove some unnecessary fixture arguments 2021-10-25 14:14:45 +03:00
Heikki Linnakangas
f337d73a6c Rearrange output dirs a bit
Each test now gets its own test output directory, like
'test_output/test_foobar', even when TEST_SHARED_FIXTURES is used.
When TEST_SHARED_FIXTURES is not used, the zenith repo for each test
is created under a 'repo' subdir inside the test output dir, e.g.
'test_output/test_foobar/repo'
2021-10-25 14:14:43 +03:00
Heikki Linnakangas
57ce541521 Remove unnecessary 'pg_bin' object from 'postgres' fixture.
It was only used in check_restored_datadir_content(), and that function
can construct it easily from the other information it has.
2021-10-25 14:14:41 +03:00
Heikki Linnakangas
e14f24034f Turn a few path-fixtures to global variables
This way, they're readily accessible from the classes and functions
that are not themselves fixtures
2021-10-25 14:14:38 +03:00
Kirill Bulatov
04fb0a0342 Add core relish backup and restore functionality 2021-10-22 22:22:38 +03:00
Heikki Linnakangas
8c42dcc041 Fix safekeeper -D option.
The -D option to specify working directory was broken:

    $ mkdir foobar
    $ ./target/debug/safekeeper -D foobar
    Error: failed to open "foobar/safekeeper.log"

    Caused by:
        No such file or directory (os error 2)

This was because we both chdir'd into to specified directory, and also
prepended the directory to all the paths. So in the above example, it
actually tried to create the log file in "foobar/foobar/safekepeer.log"
Change it to work the same way as in the pageserver: chdir to the
specified directory, and leave 'workdir' always set to ".".

We wouldn't necessarily need the 'workdir' variable in the config at all,
and could assume that the current working directory is always the
safekeeper data directory, but I'd like to keep this consistent with the
the pageserver. The page server doesn't assume that for the sake of unit
tests. We don't currently have unit tests in the safekeeper that write
to disk but we might want to in the future.
2021-10-22 08:39:58 +03:00
Alexey Kondratov
9070a4dc02 Turn off back pressure by default 2021-10-22 01:40:43 +03:00
Egor Suvorov
86a28458c6 test_runner: use Python 3.7 in CI and improve its support (#775)
* We actually need Python 3.7 because of dataclasses
* Rerun 'pipenv lock' under Python 3.7 and add 'pipenv' to dev deps
* Update docs on developing for Python 3.7
* CircleCI: use Python 3.7 via Docker image instead of Orb
2021-10-21 20:01:29 +03:00
Egor Suvorov
c058d04250 Rename WalAcceptor to Safekeeper in most places (#741) 2021-10-21 18:26:43 +03:00
Konstantin Knizhnik
c310932121 Implement backpressure for compute node to avoid WAL overflow
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
2021-10-21 18:15:50 +03:00
Egor Suvorov
ff563ff080 test_runner: fix mypy errors and force it on CI (#774)
* Fix bugs found by mypy
* Add some missing types and runtime checks, remove unused code
* Make ZenithPageserver start right away for better type safety
* Add `types-*` packages to Pipfile
* Pin mypy version and run it on CircleCI
2021-10-21 13:51:54 +03:00
anastasia
7f9d2a7d05 Change 'zenith tenant list' API to return tenant state added in 0dc7a3fc 2021-10-21 11:04:22 +03:00
Arthur Petukhovsky
13f4e173c9 Wait for safekeepers to catch up in test_restarts_under_load (#776) 2021-10-20 14:42:53 +03:00
Dmitry Ivanov
85116a8375 [proxy] Prevent TLS stream from hanging
This change causes writer halves of a TLS stream to always flush after a
portion of bytes has been written by `std::io::copy`. Furthermore, some
cosmetic and minor functional changes are made to facilitate debug.
2021-10-20 14:15:49 +03:00
Egor Suvorov
e42c884c2b test_runner/README: add note on capturing logs (#778)
Became actual after #674
2021-10-20 01:55:49 +03:00
Egor Suvorov
eb706bc9f4 Force yapf (Python code formatter) in CI (#772)
* Add yapf run to CircleCI
* Pin yapf version
* Enable `SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES` setting
* Reformat all existing code with slight manual adjustments
* test_runner/README: note that yapf is forced
2021-10-19 20:13:47 +03:00
Dmitry Rodionov
798df756de suppress FileNotFound exception instead of missing_ok=True because the latter is added in python 3.8 and we claim to support >3.6 2021-10-19 17:13:42 +03:00
Dmitry Rodionov
732d13fe06 use cached-property package because python<3.8 doesnt have cached_property in functools 2021-10-19 17:13:42 +03:00
Heikki Linnakangas
feae7f39c1 Support read-only nodes
Change 'zenith.signal' file to a human-readable format, similar to
backup_label. It can contain a "PREV LSN: %X/%X" line, or a special
value to indicate that it's OK to start with invalid LSN ('none'), or
that it's a read-only node and generating WAL is forbidden
('invalid').

The 'zenith pg create' and 'zenith pg start' commands now take a node
name parameter, separate from the branch name. If the node name is not
given, it defaults to the branch name, so this doesn't break existing
scripts.

If you pass "foo@<lsn>" as the branch name, a read-only node anchored
at that LSN is created. The anchoring is performed by setting the
'recovery_target_lsn' option in the postgresql.conf file, and putting
the server into standby mode with 'standby.signal'.

We no longer store the synthetic checkpoint record in the WAL segment.
The postgres startup code has been changed to use the copy of the
checkpoint record in the pg_control file, when starting in zenith
mode.
2021-10-19 09:48:12 +03:00
Heikki Linnakangas
c2b468c958 Separate node name from the branch name in ComputeControlPlane
This is in preparation for supporting read-only nodes. You can launch
multiple read-only nodes on the same brach, so we need an identifier
for each node, separate from the branch name.
2021-10-19 09:48:10 +03:00
Heikki Linnakangas
e272a380b4 On new repo, start writing WAL only after the initial checkpoint record.
Previously, the first WAL record on the 'main' branch overwrote the
initial checkpoint record, with invalid 'xl_prev'. That's harmless, but
also pretty ugly. I bumped into this while I was trying to tighen up the
checks for when a valid 'prev_lsn' is required. With this patch, the
first WAL record gets a valid 'xl_prev' value. It doesn't matter much
currently, but let's be tidy.
2021-10-19 09:48:04 +03:00
anastasia
0dc7a3fc15 Change tenant_mgr to use TenantState.
It allows to avoid locking entire TENANTS list while one tenant is bootstrapping
and prepares the code for remote storage integration.
2021-10-18 15:40:06 +03:00
Egor Suvorov
a1bc0ada59 Dockerfile: remove wal_acceptor alias for safekeeper (#743) 2021-10-18 14:56:30 +03:00
Kirill Bulatov
e9b5224a8a Fix toml serde gotchas 2021-10-18 14:14:27 +03:00
Heikki Linnakangas
bdd039a9ee S3 DELETE call returns 204, not 200.
According to the S3 API docs, the DELETE call returns code "204 No content"
on success.
2021-10-17 16:21:58 +03:00
Heikki Linnakangas
b405eef324 Avoid writing the metadata file when it hasn't changed. 2021-10-17 14:54:39 +03:00
Kirill Bulatov
ba557d126b React on sigint 2021-10-15 21:24:24 +03:00
Patrick Insinger
2dde20a227 Bump MSRV to 1.55 2021-10-15 09:10:08 -07:00
Kirill Bulatov
4ade0bb41c Refactor upload/download_relish function signatures.
This makes them more generic, by taking any Read / Write trait
implementation, instead of operating directly on a a file.
2021-10-15 11:34:15 +03:00
Stas Kelvich
100da024b6 expose pageserver http socket in docker 2021-10-15 00:26:38 +03:00
Arseny Sher
de744a44dd Add /timeline http request to safekeeper returning its status.
Which is mainly generational state (terms) and useful LSNs.

Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes #726.

ref #115
2021-10-14 19:02:38 +03:00
Heikki Linnakangas
0e026371ec Optimize WAL decoding slightly.
This adds a fast-path for the common case that the record doesn't
cross a page boundary. We now split off a new Bytes directly from the
original input buffer in that case, instead of copying the record to a
new BytesMut. Shaves about 5% of the page server's CPU time on my
laptop, in the 'test_bulk_insert' test.
2021-10-14 14:21:23 +03:00
Arthur Petukhovsky
4b87acb1f6 Use logging in python tests (#674)
* Use logging in python tests

* Use f-strings for logs

* Don't log test output while running

* Use only pytest logging handler

* Add more info about pytest logging
2021-10-14 13:10:09 +03:00
Dmitry Ivanov
43957f4401 [cross-repo-ci] Use solely commit hash to test PRs in CI
See #744 for the discussion.
2021-10-13 17:16:02 +03:00
Heikki Linnakangas
8a4f092e82 Skip syncing the temp initdb installation.
Doesn't make much difference on my laptop with SSD, but every little
helps, and with a slower disk it might be noticeable.
2021-10-13 16:59:00 +03:00
Egor Suvorov
6b6b3f68be Safekeeper metrics refactor (#747) 2021-10-13 16:28:24 +03:00
Arseny Sher
96f1175a80 Cleanup hardcoded oids. 2021-10-13 10:52:47 +03:00
Patrick Insinger
1c29de81de pageserver - remove lsn from WALRecord 2021-10-13 00:03:42 -07:00
Egor Suvorov
f658263543 Revert "Dockerfile: remove wal_acceptor alias for safekeeper"
This reverts commit 64ca947722.
2021-10-12 19:05:58 +00:00
Egor Suvorov
64ca947722 Dockerfile: remove wal_acceptor alias for safekeeper 2021-10-12 19:05:16 +00:00
Egor Suvorov
23f4c0a742 Rename wal_acceptor binary to safekeeper (#740), stage 1/2
* Rename wal_acceptor binary to safekeeper
* Rename wal_acceptor.pid and wal_acceptor.log to safekeeper.pid and safekeeper.log
* Change some mentions of WAL acceptor to safekeeper
* Dockerfile: alias wal_acceptor to safekeeper temporarily until internal scripts are updated
2021-10-12 22:03:06 +03:00
Dmitry Ivanov
7c5b99683c Speed up builds by passing make jobserver to cargo
This change brings the following improvements to our build system:

* Now BUILD_TYPE also affects rust apps.
* From now on, cargo will respect `-jN` passed via `make`. However, note
  that `rustc` may spawn multiple threads depending on compile flags.
* Cargo is able to cooperate with make to better schedule parallel jobs,
  which leads to better build times (-20s in release mode on my machine).
2021-10-12 21:02:39 +03:00
Patrick Insinger
160c4aff61 pageserver - use write guard for checkpointing 2021-10-12 10:02:15 -07:00
Patrick Insinger
6e5ca5dc5c pageserver - create TimelineWriter 2021-10-12 10:02:15 -07:00
Egor Suvorov
f3445949d1 Wal acceptor: report socket bind errors better when daemonizing (#738)
Fixes #664
2021-10-12 16:51:28 +03:00
Heikki Linnakangas
95a85312f5 Simplify code to build walredo messages.
No need to use BytesMut in these functions. Plain Vec is simpler. And
should be marginally faster too; I saw BytesMut functions previously
in 'perf' profile, consuming around 5% of the overall pageserver CPU
time. That's gone with this patch, although I don't see any discernible
difference in the overall performance test results.
2021-10-12 10:16:26 +03:00
Heikki Linnakangas
934fb8592f Detect when a checkpoint is modified in a smarter way.
Previously, the WAL receiver we would make a decoded copy of the current
Checkpoint before each WAL record, and compare it with the Checkpoint
after the record has been processed. If it has changed, the checkpoint
relish is updated in the repository. That's somewhat expensive, the
Checkpoint::encode() function is visible in 'perf' profile. Change that
so that we set a flag whenever the Checkpoint struct is modified, so that
we dont need to compare the whole struct anymore.
2021-10-12 09:09:10 +03:00
Dmitry Ivanov
bb239b4f69 [Makefile] Set default build type to debug 2021-10-11 17:08:31 +03:00
Dmitry Ivanov
1cd7900790 [Makefile] Make build type detection more precise
Previously, typos like `BUILD_TYPE=rlease` would silently
lead to building debug binaries. The current approach is also
more future-proof, since we might add `profile`, `valgrind`
as well as other build types.
2021-10-11 17:03:51 +03:00
Arseny Sher
8c61c3e54e Minor safekeeper readme fix. 2021-10-11 16:31:44 +03:00
anastasia
d7c9dd06f4 Implement graceful shutdown at 'pageserver stop':
- perform checkpoint for each tenant repository.
- wait for the completion of all threads.

Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.
2021-10-11 13:35:01 +03:00
Heikki Linnakangas
b9119f11bf Add perf test case for buffering GiST build.
When a WAL record affects multiple pages, we currently duplicate the
record for each affected page. That's a bit wasteful, but not too bad
for b-tree splits and non-hot heap updates that affect two pages. But
buffering GiST index build WAL-logs the whole relation in 32 page chunks,
with one giant WAL record for each 32-page chunk. Currently we duplicate
that giant record for each of the 32 pages, which is really wasteful.

Github issue https://github.com/zenithdb/zenith/issues/720 tracks the
problem. This commit adds a test case for it to demonstrate it.
2021-10-11 11:10:58 +03:00
Heikki Linnakangas
7216f22609 Use tracing crate to have more context in log messages.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.

This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.

Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.

We now obey the RUST_LOG env variable, if it's set.

The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing.  For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
2021-10-11 08:59:06 +03:00
Kirill Bulatov
bf58f7f649 Expose certain layered repository structs to reuse in relish storage (#688) 2021-10-09 19:23:57 +03:00
Patrick Insinger
3f0ebc6a40 pageserver - move early File::open call 2021-10-09 08:45:52 -07:00
Patrick Insinger
0baf4bc796 fix cargo doc complaints 2021-10-09 08:45:46 -07:00
Patrick Insinger
c356030660 pageserver - use VecMap for delta metadata & sizes 2021-10-08 15:05:22 -07:00
Patrick Insinger
c4bb6d78d4 pageserver - use VecMap for in memory segsizes 2021-10-08 14:37:32 -07:00
Patrick Insinger
3b82e806f2 pageserver - use VecMap for in-memory PageVersions 2021-10-08 14:11:07 -07:00
Egor Suvorov
403d9779d9 safekeeper: add initial metrics and HTTP handler (#699, #541)
* `wal_acceptor`: add HTTP handler, /metrics endpoint only, no authentication
* Two gauges are currently reported: `flush_lsn` and `commit_lsn`
* Add `DEFAULT_PG_LISTEN_PORT` and `DEFAULT_PG_LISTEN_PORT` consts for uniformity
2021-10-08 18:55:41 +03:00
Patrick Insinger
b3b8f18f61 tests - fix get_timeline_size signature 2021-10-07 15:38:22 -07:00
Heikki Linnakangas
960c7d69a8 Remove 'predecessor' reference from in-memory and delta layers.
The caller is now responsible for lookin up the predecessor layer,
instead. This makes the code simpler, as you don't need to update the
predecessor reference when a layer is frozen or written to disk.

There was a bug in that, as Konstantin noted on discord:

    Assume that freeze doesn't create new inmem layer
    (maybe_new_open=None). Then we temporary place in historics frozen
    layer. Assume that now new put_wal_record request arrives. There is
    no open in-mem layer, so it has to create new one. It is looking for
    previous layer for read and set it as new in-mem layer
    predecessor. But as far as I understand, prev layer should be our
    temporary frozen layer. Which will be then removed from
    historics.

That leaves the predecessor field of the new in-memory layer pointing
at the frozen in-memory layer that has been removed from the layer map,
preventing it from being removed from memory.

This makes two subtle changes:

1. When the first new layer is created on a branch for a segment that
   existed on the ancestor branch, the start_lsn of the new layer is now
   the branch point + 1. We were previously slightly confused on what
   the branch point LSN meant. It means that all the WAL up to and
   *including* the LSN on the old branch is visible to the new branch.
   If we mark the start LSN of the new layer as equal to the branch point,
   that's wrong, because if there is a WAL record with that LSN on the
   predecessor layer, the new layer would hide it. This bug was hidden
   when the layer on the new branch contained a direct reference to the
   layer in the old branch, as get_page_reconstruct_data() followed that
   reference directly when it didn't find the page version in the new
   layer. But now that the caller performs the lookup, it will look up
   the new layer that doesn't contain the record, and you get an error.

2. InMemoryLayer now always stores the segment size at the beginning
   of the layer's LSN range. Previously, get_seg_size() might have
   recursed into the predecessor layer to get the size, but now we
   avoid that by always copying over the last size from the previous
   layer, when a new layer is created.
2021-10-08 00:54:13 +03:00
Heikki Linnakangas
60dae0b4ac Add test case that demonstrates Write Amplification. 2021-10-08 00:34:29 +03:00
Heikki Linnakangas
c660926a06 Refactor duplicated code to get on-disk timeline size in tests.
Move it to a common function. In the passing, remove the obsolete check
to exclude the 'wal' directory. The 'wal' directory is no more.
2021-10-08 00:34:26 +03:00
Egor Suvorov
7fa04e2d14 zenith_metrics: exit process on config errors (#706) 2021-10-08 00:14:56 +03:00
Heikki Linnakangas
db4059cd6d Measure peak memory usage in perf test.
Another useful metric to keep an eye on.
2021-10-07 18:03:20 +03:00
Heikki Linnakangas
fdb19fdb92 Remove unused function.
The caller was removed in commit acc0f41985.
2021-10-07 11:24:27 +03:00
Heikki Linnakangas
53b4dc944d Don't create unused "wal" directory
It hasn't been used since commit ca9af37478.
2021-10-07 10:36:26 +03:00
MMeent
a03e1b3895 Docker build now also uses BUILD_TYPE=release. (#712)
The dockerignore and dockerfile have also been excluded from being moved into
docker images, saving docker layer cache busts if only those are changed.
2021-10-06 23:42:00 +02:00
Heikki Linnakangas
15f1bcc9c2 Remove obsolete code, now that we don't load WAL from local disk anymore.
Commit ca9af37478 removed the import_timeline_wal() call from here.
After that, the info!() message is bogus, as we no longer load the WAL
from local disk. Also, the logical size assertion is pointless now.
2021-10-06 15:59:28 +03:00
MMeent
24580f2493 Improve build system: (#703)
- Build postgresql with -O2 for releases
 - Make make make postgresql with 8 parallel threads
   The node is xlarge, so it has 8 vCPU available
2021-10-06 14:37:27 +02:00
Heikki Linnakangas
e3945d94fd Store unlogged tables locally, and replace PD_WAL_LOGGED.
All the changes are in the vendor/postgres side. However, because we now
generate fewer Full Page Writes, the 'branch_behind' test needs to be
modified so that it still generates enough WAL to consume a few WAL
segments.
2021-10-06 10:58:15 +03:00
Heikki Linnakangas
d806c3a47e pageserver - serialize PageVersion as it is
Removes the need for PageVersionMeta struct.
2021-10-05 11:07:50 -07:00
Egor Suvorov
05fe39088b Readme updates based on a fresher Ubuntu installation experience (#627) 2021-10-05 19:19:25 +03:00
Egor Suvorov
530d3eaf09 Add more details to pageserver and safekeeper docs (#680) 2021-10-05 19:10:50 +03:00
Egor Suvorov
7e190d72a5 Make pageserver_ prefix for common metric names configurable (#681) 2021-10-05 19:06:44 +03:00
Patrick Insinger
9c936034b6 pageserver - fix newer clippy lints 2021-10-05 00:28:14 -07:00
Kirill Bulatov
5719f13cb2 Rework the relish thread model (#689) 2021-10-05 10:15:56 +03:00
Patrick Insinger
d134a9856e pageserver - introduce RepoHarness for testing 2021-10-04 08:36:35 -07:00
Patrick Insinger
664b99b5ac pageserver - use constant TIMELINE_ID for tests 2021-10-04 08:36:35 -07:00
Arseny Sher
4256231eb7 Enable test_start_compute with safekeepers.
It should work now.
2021-10-04 16:50:46 +03:00
Andrey Taranik
ae27490281 wal_acceptors added to tenant creation tests 2021-10-04 08:58:49 +03:00
Andrey Taranik
fbd8ca2ff4 minor code beautification 2021-10-04 08:58:49 +03:00
Andrey Taranik
ec673a5d67 bulk tenant create test added 2021-10-04 08:58:49 +03:00
Max Sharnoff
7fab38c51e Use threadlocal for walreceiver check (#692) 2021-10-01 15:47:45 -07:00
Max Sharnoff
84f7dcd052 Fix clippy errors on nightly (2021-09-29) (#691)
Most of the changes are for the new if-then-panic lint added in
https://github.com/rust-lang/rust-clippy/pull/7669.
2021-10-01 15:45:42 -07:00
Patrick Insinger
7095a5d551 pageserver - reject and backup future layer files
If a layer file is found with LSN after the disk_consistent_lsn, it is
renamed (to avoid conflicts with new layer files) and a warning is logged.
2021-10-01 11:41:39 -07:00
Patrick Insinger
538c2a2a3e pageserver - store timeline metadata durably
The metadata file is now always 512 bytes. The last 4 bytes are a
crc32c checksum of the previous 508 bytes. Padding zeroes are added
between the serde serialization and the start of the checksum.

A single write call is used, and the file is fsyncd after.
On file creation, the parent directory is fsyncd as well.
2021-10-01 11:41:39 -07:00
Patrick Insinger
62f83869f1 pageserver - fsync image/delta layers
Ensure image and delta layer files are durable.
Also, fsync the parent directory to ensure the directory entries are
durable.
2021-10-01 11:41:39 -07:00
Patrick Insinger
69670b61c4 pageserver - use crashsafe_dir utility
Replace usage of std::fs::create_dir/create_dir_all with crashsafe
equivalents.
2021-10-01 11:41:39 -07:00
Patrick Insinger
0a8aaa2c24 zenith_utils - add crashsafe_dir
Utility for creating directories and directory trees in a crash safe
manor.

Minimizes calls to fsync for trees.
2021-10-01 11:41:39 -07:00
Heikki Linnakangas
e474790400 Print more details on errors to log
Fixes https://github.com/zenithdb/zenith/issues/661
2021-10-01 17:57:41 +03:00
Alexey Kondratov
2c99e2461a Allow usage of the compute hostname in the proxy 2021-10-01 16:24:35 +03:00
Stas Kelvich
cf8e27a554 Proxy: pass database name in console too 2021-10-01 14:27:52 +03:00
Kirill Bulatov
287ea2e5e3 Limit concurrent relish storage sync operations 2021-10-01 08:37:09 +03:00
Heikki Linnakangas
86e14f2f1a Bump vendor/postgres 2021-09-30 20:36:57 +03:00
Arseny Sher
adbae62281 Rename SharedState.commit_lsn to notified_commit_lsn.
ref #682
2021-09-30 17:29:15 +03:00
Egor Suvorov
3127a4a13b Safekeeper::Storage::write_wal: clarify behavior (#679)
It previously took &SafeKeeperState similar to persist(), but only for its
`server` member.
Now it takes &ServerInfo only, so there it's clear the state is not persisted.
Also added a comment about sync.
2021-09-29 19:58:30 +03:00
Egor Suvorov
6d993410c9 docs/README: fix link to walkeeper's README (#677) 2021-09-29 14:40:16 +03:00
Kirill Bulatov
fb05e4cb0b Show better error messages on pageserver failures 2021-09-29 01:55:41 +03:00
Egor Suvorov
b0a7234759 pageserver: fix stale default listen addrs
* In command line help
* In dummy_conf
2021-09-28 20:57:51 +03:00
Egor Suvorov
ddf4b15ebc pageserver: use const_format crate to generate default listen addrs 2021-09-28 20:57:51 +03:00
Egor Suvorov
3065532f15 pageserver: fix mistype in listen-http arg help 2021-09-28 20:57:51 +03:00
Arthur Petukhovsky
d6fc74a412 Various fixes for test_sync_safekeepers (#668)
* Send ProposerGreeting manually in tests

* Move test_sync_safekeepers to test_wal_acceptor.py

* Capture test_sync_safekeepers output

* Add comment for handle_json_ctrl

* Save captured output in CI
2021-09-28 19:25:05 +03:00
Arseny Sher
7a370394a7 Wait till previous victim recovers in run_restarts_under_load.
Fixes test flakiness, as recovery easily might take the whole iteration.
2021-09-28 19:15:41 +03:00
Stas Kelvich
0f3cf8ac94 Cleanup Dockerfile.
* make .dockerignore `ncdu -X` compatible to easily inspect build context
* remove cargo-chef as it was introducing more problems than it was solving
* remove rocksdb packages
* add ca-certs in the resulting image. We need that to be able to make https
  connections from container with proxy to the console.
2021-09-28 18:26:20 +03:00
Heikki Linnakangas
014be8b230 Use Iterator, to avoid making one copy of page_versions BTreeMap
Reduces the CPU time spent in checkpointing, in the write_to_disk()
function.
2021-09-27 19:28:02 +03:00
Heikki Linnakangas
08978458be Refactor write_to_disk, handling dropped segment as a special case.
Similar to what commit 7fb7f67b did to 'freeze', dealing with the
dropped segment separately from the rest of the logic makes the code
easier to follow. It is also needed by the next commit that replaces
the code to build new BTreeMap with an iterator; we cannot pass one
of two kinds of closures as argument, it has to always be the same one.
Having separate DeltaLayer::create() calls for the case of dropped
segment and the other cases works around that.
2021-09-27 19:23:32 +03:00
Heikki Linnakangas
2252d9faa8 Switch to RwLock in InMemoryLayer
Allows more parallelism basically for free.
2021-09-27 19:15:40 +03:00
Arthur Petukhovsky
22e15844ae Fix clippy errors (#673) 2021-09-27 18:59:30 +03:00
Konstantin Knizhnik
ca9af37478 Do not write WAL at pageserver (#645)
* Do not write WAL at pageserver

* Remove import_timeline_wal function
2021-09-27 14:15:55 +03:00
Stas Kelvich
aae41e8661 Proxy pass for existing users.
Ask console to check per-cluster auth info.
2021-09-27 11:56:43 +03:00
Stas Kelvich
8331ce865c Interceipt and log error in mgmt interface.
That PostgresBackend is better be replaced with the http server or redis
subscription. For now let's improve logging and move on.
2021-09-27 11:56:43 +03:00
Stas Kelvich
3bac4d485d Fix EncryptionResponse message in pq_proto.rs
Positive EncryptionResponse should set 'S' byte, not 'Y'. With that
fix it is possible to connect to proxy with SSL enabled and read
deciphered notice text. But after the first query everything stucks.
2021-09-27 11:56:43 +03:00
Stas Kelvich
f84eaf4f05 Leave only pkcs8 keys support for proxy.
rsa_private_keys() function returns an empty vector when tries to read
pkcs8-encoded file instead of returning an error. So previous check was
failing on pkcs8. Leave only pkcs8 for now.
2021-09-27 11:56:43 +03:00
Arseny Sher
70b08923ed Disable new safekeepers tests as not stable enough. 2021-09-26 22:33:58 +03:00
Heikki Linnakangas
c846a824de Bump vendor/postgres, to use buffered I/O in WAL redo process.
Greatly reduces the CPU overhead in the WAL redo process.
2021-09-24 21:48:30 +03:00
Heikki Linnakangas
b71e3a40e2 Add more details to the log, when an error happens in GetPage request. 2021-09-24 21:44:22 +03:00
Heikki Linnakangas
41dfc117e7 Buffer the writes to the WAL redo process pipe.
Reduces the CPU time spent in the write() syscalls. I noticed that we were
spending a lot of CPU time in libc::write, coming from request_redo(), in
the 'bulk_insert' test. According to some quick profiling with 'perf',
this reduces the CPU time spent in request_redo() from about 30% to 15%.

For some reason, it doesn't reduce the overall runtime of the 'bulk_insert'
test much, maybe by one second if you squint (from about 37s to 36s), so
there must be some other bottleneck, like I/O. But this is surely still
a good idea, just based on the reduced CPU cycles.
2021-09-24 21:12:38 +03:00
sharnoff
a72707b8cb Redo #655 with fix: Allow LeSer/BeSer impls missing either Serialize or Deserialize
Commit message copied below:

* Allow LeSer/BeSer impls missing Serialize/Deserialize

Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.

Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.

* Remove unused #[derive(Serialize/Deserialize)]

This should hopefully reduce compile times - if only by a little bit.

Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
2021-09-24 10:58:01 -07:00
Max Sharnoff
0f770967b4 Revert "Allow LeSer/BeSer impls missing either Serialize or Deserialize (#655)
This reverts commit bd9f4794d9.
2021-09-24 10:18:36 -07:00
Max Sharnoff
bd9f4794d9 Allow LeSer/BeSer impls missing either Serialize or Deserialize (#655)
* Allow LeSer/BeSer impls missing Serialize/Deserialize

Currently, using `LeSer` or `BeSer` requires that the type implements
both `Serialize` and `DeserializeOwned`, even if we're only using the
trait for one of those functionalities.

Moving the bounds to the methods gives the convenience of the traits
without requiring unnecessary derives.

* Remove unused #[derive(Serialize/Deserialize)]

This should hopefully reduce compile times - if only by a little bit.

Some of these were already unused (we weren't using LeSer/BeSer for the
types), but most are have *become* unused with the change to
LeSer/BeSer.
2021-09-24 10:06:03 -07:00
Heikki Linnakangas
ff5cbe2694 Support overlapping and nested Layers in the layer map.
This introduces a new tree data structure for holding intervals, and
queries of the form "which intervals contain the given point?". It then
uses that to store the Layers in the layer map, instead of the BTreeMap.

While we don't currently create overlapping layers in the page server,
that situation might arise in the future if we start to create extra
layers for performance purposes, or as part of some multi-stage
garbage collection operation that creates new layers in some interval
and then removes old ones. The situation might also arise if you have
multiple page servers running on the same timeline, freezing layers at
different points, and both uploading them to S3.

So even though overlapping layers might not happen currently, let's
avoid getting confused if it does happen for some reason.

Fixes https://github.com/zenithdb/zenith/issues/517.
2021-09-24 14:10:52 +03:00
Heikki Linnakangas
2319e0ec8f Define a layer's start and end bounds more precisely.
After this, a layer's start bound is always defined to be inclusive, and
end bound exclusive.

For example, if you have a layer in the range 100-200, that layer can be
used for GetPage@LSN requests at LSN 100, 199, or anything in between.
But for LSN 200, you need to look at the next layer (if one exists).

This is one part of a fix for https://github.com/zenithdb/zenith/issues/517.
After this, the page server shouldn't create layers for the same segment
with the same LSN, which avoids the issue. However, the same thing would
still happen, if you managed to create layers with same start LSN again.
That could happen e.g. if you had two page servers running, or in some
weird crash/restart scenario, or due to bugs or features added later. The
next commit makes the layer map more robust, so that it tolerates that
situation without deleting wrong files.
2021-09-24 14:10:49 +03:00
Arthur Petukhovsky
d4e037f1e7 Support for --sync-safekeepers in tests (#647)
New command has been added to append specially crafted records in safekeeper WAL. This command takes json for append, encodes LogicalMessage based on json fields, and processes new AppendRequest to append and commit WAL in safekeeper.

Python test starts up walkeepers and creates config for walproposer, then appends WAL and checks --sync-safekeepers works without errors. This test is simplest one, more useful test cases (like in #545) for different setups will be added soon.
2021-09-24 13:19:59 +03:00
Max Sharnoff
139936197a bump vendor/postgres: Catch walkeeper ErrorResponse (#650)
Postgres commit message:

PQgetCopyData can sometimes indicate that the copy is done if the
backend returns an error response. So while we still expect that the
walkeeper never sends CopyDone, we can't expect it to never produce
errors.
2021-09-23 14:55:38 -07:00
Heikki Linnakangas
d4eed61f57 Refactor code for parsing and creating postgresql.conf.
There's surely more that could be done, but this makes it a bit more
readable at least.
2021-09-23 19:34:27 +03:00
Patrick Insinger
7db3a9e7d9 walredo - don't use RefCell on stdin/stdout 2021-09-23 08:42:58 -07:00
Patrick Insinger
c81ee3bd5b Add some comments to the checkpoint process 2021-09-23 13:19:45 +03:00
anastasia
7fb7f67bb4 Fix relish extention after it was dropped or truncated.
- Turn dropped layers into non-writeable in get_layer_for_write().

- Handle non-writeable dropped layers in checkpointer. They don't need freezing, so just remove them from list of open_segs and write out to disk.

- Remove code that handles dropped layers in freeze() function. It is not used anymore.
2021-09-23 13:19:45 +03:00
anastasia
86164c8b33 Add unit tests for drop_lsn.
test_drop_extend and test_truncate_extend illustrate what happens if we dropped a segment and then created it again within the same layer.
2021-09-23 13:19:45 +03:00
Arseny Sher
97c4cd4434 bump vendor/postgres 2021-09-23 12:22:53 +03:00
anastasia
a4fc6da57b Fix gc_internal to treat dropped layers.
Some dropped layers serve as tombstones for earlier layers and thus cannot be garbage collected.
Add new fields to GcResult for layers that are preserved as tombstones
2021-09-23 12:21:47 +03:00
anastasia
c934e724a8 Enable test_list_rels_drop test 2021-09-23 12:21:47 +03:00
anastasia
e554f9514f gc refactoring
- rename 'compact' argument of GC to 'checkpoint_before_gc'.
- gc_iteration_internal() refactoring
2021-09-23 12:21:47 +03:00
Max Sharnoff
d7cff8fbaf Show more detailed query errors from postgres_backend (#651)
anyhow uses the alternate formatting style ("{:#}") to display all of
the causes of an error instead of the outermost context.

Without this, there's less information available to figure out what's
going on. It's probably too much to display in the compute node logs
though, so it's better to leave that formatting as-is.
2021-09-22 14:51:14 -07:00
Max Sharnoff
90ef661673 Fix rustc & clippy warnings for nightly (2021-09-19) (#629)
Fix clippy warnings for nightly (2021-09-19)
2021-09-22 11:24:43 -07:00
Dmitry Rodionov
579b5ee944 exclude labels formatting for every operation in LOGICAL_TIMELINE_SIZE gauge metric 2021-09-22 18:03:48 +03:00
Arthur Petukhovsky
8ebf2fe550 Add test for acceptor restarts under load (#591)
In this test safekeepers are restarted one by one, while bank transactions
are executed and validated in the background. Bank transactions consist of
balance transfers and log writes. In the end balance sum should remain the
same and there should be progress from every client, when 2 of 3 safekeeper
nodes are up.
2021-09-22 11:59:20 +03:00
Dmitry Rodionov
16d3dc821a disable parallelization for benchmarks 2021-09-21 23:08:22 +03:00
Heikki Linnakangas
a91eeb1c65 Buffer the writes when writing a layer to disk.
Significantly reduces the CPU time spent on libc::write.
2021-09-21 16:54:29 +03:00
Heikki Linnakangas
49c8c03465 Add performance test for bulk INSERT 2021-09-21 13:25:46 +03:00
Dmitry Rodionov
5344ffc3de try to reenable parallel test runs in CI 2021-09-20 21:43:09 +03:00
Heikki Linnakangas
296586b7ce bump vendor/postgres 2021-09-20 18:52:55 +03:00
Dmitry Rodionov
b7aac87ec1 fix port distribution so services do not use ephemeral ports 2021-09-20 18:44:42 +03:00
Patrick Insinger
ea4c3639e3 Include layer metadata in layer summary chapters
Include all data stored in layer filenames and the tenant+timeline IDs
inside a summary chapter. Use this chapter in the `dump_layerfile`
utility.
2021-09-20 07:57:51 -07:00
Heikki Linnakangas
745627c8ca Remove unused FE/BE ControlFile message.
It's a remnant of some old tests in Zenith, but isn't used anymore. It
doesn't exist in PostgreSQL.
2021-09-17 20:06:04 +03:00
Heikki Linnakangas
c2af6d98db Don't print 'pg_controldata' output after every startup in tests.
It's not interesting for most tests, and clutters the output. If there
are individual tests where it is worthwhole, let's add pg_controldata calls
to those tests, but I don't think it's needed for now.
2021-09-17 20:04:29 +03:00
Heikki Linnakangas
540973eac4 Don't get confused on request of latest page version with very old LSN.
If the 'latest' flag in the client request is true, the client wants the
latest page version regardless of the LSN in the request. The LSN is just
a hint in that case, indicating that the page hasn't been modified since
since that LSN. The LSN can be very old, so it's possible that the page
server has already garbage collected away the layer at that LSN. We tried
to fetch the old layer and errored out if that happened. To fix, always
fetch the data as of last-record-LSN, if 'latest' is set in the client
request. We now only use the LSN to wait if the requested LSN hasn't been
received and processed yet.

Fixes https://github.com/zenithdb/zenith/issues/567
2021-09-17 18:56:05 +03:00
Heikki Linnakangas
ad5f16f724 Improve the protocol between Postgres and page server.
- Use different message formats for different kinds of response messages.

- Add an Error message, for passing errors from page server to Postgres.
  Previously, we would respond to 'exists' request with 'false', and
  to 'nblocks' request with 0, if an error happened. Fix those to return
  an error message to the client. GetPage requests had a mechanism to
  return an error, but it was just a flag with no error message.

- Add a flag to requests, to indicate that we actually want the latest
  page version on the timeline, and the LSN is just a hint that we know
  that there haven't been any modifications since that LSN. The flag isn't
  used for anything yet, but I'm planning to use it to fix
  https://github.com/zenithdb/zenith/issues/567
2021-09-17 16:38:14 +03:00
Kirill Bulatov
1aa7218fd6 Show underlying pageserver error details 2021-09-17 16:16:05 +03:00
Kirill Bulatov
1d5abf1253 Initial version of the relish storage 2021-09-17 15:30:22 +03:00
Dmitry Ivanov
7b3fb760fa [test_runner] psql should be oblivious to user's preferences
This makes psql ignore $HOME/.psqlrc
2021-09-17 14:16:23 +03:00
Max Sharnoff
3743344e64 Add get_timeline_for_tenant() to tenant_mgr (#615)
Most of the previous usages of get_repository_for_tenant were followed
by immediately getting a timeline in that repository, without keeping it
around for longer.

The new `get_timeline_for_tenant` function implements that same
behavior, but in one line.
2021-09-16 10:38:21 -07:00
Max Sharnoff
bbe4f39790 walkeeper: Add parsing check for hot standby tag (#597) 2021-09-16 09:04:35 -07:00
Kirill Bulatov
7dda9f2894 Fix clippy lints and enable clippy checking in CI 2021-09-16 15:09:16 +03:00
anastasia
8de41f1d70 Change checkpoint_distance type to u64 2021-09-16 12:33:50 +03:00
anastasia
6984d33b4e Run GC and checkpointer separate threads.
Add checkpoint_period configuration parameter
2021-09-16 12:33:50 +03:00
anastasia
98d4f9cea5 Add checkpoint_distance config parameter.
- Change hardcoded OLDEST_INMEM_DISTANCE value to pageserver config option checkpoint_distance.
- Get rid of 'force' flag in checkpoint_internal(). Use checkpoint_distance=0 instead.
2021-09-16 12:33:50 +03:00
Arseny Sher
87bc18972f bump vendor/postgres 2021-09-16 11:41:29 +03:00
Patrick Insinger
25b7d424ab Prevent frozen InMemoryLayer races
Instead of panicking when a race happens, retry the operation after
getting a new layer.
2021-09-15 20:50:51 -07:00
Patrick Insinger
a5bd306db9 Ensure InMemoryLayer predecessor updated correctly
When the new open InMemoryLayer predecessor is updated, ensure it was
pointing to the old frozen layer.
2021-09-15 16:04:49 -07:00
Patrick Insinger
0cbee4a416 Don't hold lock on LayerMap while writing to disk 2021-09-15 16:04:49 -07:00
Patrick Insinger
91ff09151d Remove disk IO from InMemoryLayer::freeze
Move the creation of Image and Delta layers from
`InMemoryLayer::freeze()` to `InMemoryLayer::write_to_disk`.
2021-09-15 16:04:49 -07:00
Patrick Insinger
fea5954b18 Change filling gap println! to trace! 2021-09-15 14:22:04 -07:00
Max Sharnoff
b11b0bb088 bin_ser: reject trailing bytes by default (#587)
Changes `LeSer`/`BeSer::des`. Also adds a new `des_prefix` function to
keep a way to allow trailing bytes.
2021-09-15 11:48:19 -07:00
Dmitry Rodionov
0ede933719 temporary disable parallel test runs as it seems to misbehave when there
are several concurrent CI runs
2021-09-15 18:59:59 +03:00
Kirill Bulatov
3ab60ce76f Unify tokio deps and bump cargo resolver version 2021-09-15 16:00:08 +03:00
Dmitry Rodionov
01ef2baef0 show more context for zenith cli run errors 2021-09-15 14:02:15 +03:00
Dmitry Rodionov
6a2e4bfdd9 use parallel test execution in ci 2021-09-15 14:02:15 +03:00
Dmitry Rodionov
9563336d9a Bring back check for interferring processes, add more comments and
descriptive errors
2021-09-15 14:02:15 +03:00
Dmitry Rodionov
4ebe643d0c Support parallel test running for python tests
Support is done via pytest-xdist plugin.
To use the feature add -n<concurrency> to pytest invocation
e.g. pytest -n8 to run 8 tests in parallel.

Changes in code are mostly about ports assigning. Previously port for
pageserver was hardcoded without the ability to override through zenith
cli and ports for started compute nodes were calculated twice, in zenith
cli and in test code. Now zenith cli supports port arguments for
pageserver and compute nodes to be passed explicitly.

Tests are modified in such a way that each worker gets a non overlapping
port range which can be configured and now contains 100 ports. These
ports are distributed to test services (pageserver, wal acceptors,
compute nodes) so they can work independently.
2021-09-15 14:02:15 +03:00
Dmitry Rodionov
dc897fb864 remove pageserver remotes support since we do not have tests for that and feature itself is delayed (#136) 2021-09-15 13:24:35 +03:00
Max Sharnoff
a2498f3e67 Improve walkeeper replication error messages & context (#585) 2021-09-14 11:59:14 -07:00
Patrick Insinger
d150f3ce8c Detect writes on frozen InMemoryLayers
Data written to frozen layers is lost. It will not appear in on-disk
structures or in successor InMemoryLayers. Here we detect this race, and
fail. I think this race is rare, but this should make it easier to track
down when it happens.
2021-09-14 11:44:48 -07:00
Patrick Insinger
cff4572774 Avoid race in get_layer_for_write
Implement the changes suggested in a comment, create
`get_layer_for_read_locked` so that `get_layer_for_write` doesn't have
to drop the LayerMap lock when searching for the predecessor.
2021-09-14 11:24:24 -07:00
Dmitry Rodionov
84008a2560 factor out common logging initialisation routine
This contains a lowest common denominator of pageserver and safekeeper log
initialisation routines. It uses daemonize flag to decide where to
stream log messages. In case daemonize is true log messages are
forwarded to file. Otherwise streaming to stdout is used. Usage of
stdout for log output is the default in docker side of things, so make
it easier to browse our logs via builtin docker commands.
2021-09-14 18:09:14 +03:00
Dmitry Ivanov
6b7f3bc78c Add inter-repo CI job to CircleCI configuration
This job will be responsible for triggering remote CI pipeline in
zenithdb/console repository. That way, we'll always know when
a PR to zenithdb/zenith breaks the cloud console app.
2021-09-14 16:56:04 +03:00
Arseny Sher
a68c23448a Skip the bootstrap hole in safekeeper's find_end_of_wal.
Otherwise restart of safekeeper before the first segment is filled makes it
report 0 as flushed LSN. To this end, tweak find_end_of_wal_segment to allow
starting from given LSN, not only from the start of the segment. While here,
make it less panicky.
2021-09-13 22:46:04 +03:00
Dmitry Rodionov
9043f45489 removes protobuf dependency (brought by prometheus default features) 2021-09-13 15:57:41 +03:00
Heikki Linnakangas
6afd99c73f Fix misc typos in comments. 2021-09-13 12:31:04 +03:00
nkotlyarov
18b5165b22 Update README.md
typo
2021-09-12 15:35:18 +03:00
Arseny Sher
6dc66eefb6 bump vendor/postgres 2021-09-11 06:10:10 +03:00
Arseny Sher
0aec60938a Make flush_lsn reported by safekeepers point to record boundary.
Otherwise we produce corrupted record holes in WAL during compute node restart
in case there was an unfinished record from the old compute, as these reports
advance commit_lsn -- reliably persisted part of WAL.

ref #549.

Mostly by @knizhnik. I adjusted to make sure proposer always starts streaming
since record beginning so we don't need special quirks for decoding in
safekeeper.
2021-09-11 06:10:10 +03:00
Patrick Insinger
7c62a57e54 initialize tenant_mgr after daemonizing
Ran into problems launching the WAL redo process on OS X after 4b73ad.
Launching the `initdb` process was met with "bad file descriptor" errors.
Using dtrace, I found shortly after calling `posix_spawn` for `initdb`,
`kevent` was returning this error.

I haven't dug super deep to see if the daemonization itself is the
problem, but this commit fixes it for me. My hunch is that some file
descriptors used when the Tokio runtime is initailzed become invalid
in the daemon process.
2021-09-10 13:00:39 +03:00
Heikki Linnakangas
59e7ca585d Minor fixes 2021-09-10 12:43:11 +03:00
anastasia
3dea06b825 Update layered_repository/README.md 2021-09-10 12:43:11 +03:00
Heikki Linnakangas
ab33614ab1 Forbid adding WAL to the repository after advancing last record LSN.
When you advance last record LSN, *all* changes up to that LSN should be
imported into repository. We have been a bit sloppy about that when it
comes to the checkpoint information that we also store in the repository.
In WAL receiver, for example, we would receive a WAL record, advance
last record LSN, and only then update the checkpoint relish at the same
LSN. Reorder that so that you advance the last record LSN only after
updating the checkpoint relish. It hasn't apparently caused any problems
so far, but let's be tidy.

Tighten the check for that in get_layer_for_write(), so that it checks for
'lsn > last_record_lsn' rather than 'lsn >= last_record_lsn'.
2021-09-10 10:59:09 +03:00
Heikki Linnakangas
03dff207db Remove start_lsn arg from create_empty_repository.
Always use lsn(0) as the initial last_record_lsn. It is updated soon after
creating the timeline anyway, after loading the bootstrap data, so it
doesn't stay long in that state. I was a bit worried about using a special
value like 0, but it's actually nice that you can distinguish it from any
real LSN value. The unit tests have been using Lsn(0) as the initial start
LSN all along.
2021-09-10 10:24:35 +03:00
Heikki Linnakangas
6a8785379a Add explicit 'wait_lsn' calls before get_page_at_lsn and such calls.
Move the responsibility to wait for the WAL to arrive to the callers, and
remove the wait_lsn() calls from the Timeline::get_page_at_lsn() and
friends. We were not totally consistent before, list_rels() was missing the
wait_lsn() call for example.

Closes https://github.com/zenithdb/zenith/issues/521
2021-09-10 09:56:11 +03:00
Heikki Linnakangas
507177b42e Refactor code to handle incoming page requests. 2021-09-09 18:48:46 +03:00
anastasia
b79754d06e list_rels() and list_nonrels() refactoring:
move shared code to list_relishes() function.
2021-09-09 16:05:32 +03:00
anastasia
674807eee1 Add test for dropped reltaions. Fix list_rels() and list_nonrels() functions 2021-09-09 16:05:32 +03:00
Konstantin Knizhnik
30c0343727 Use layer start_lsn instead of *entry_lsn as LSN to continue WAL record traversal at next layer (#573)
refer #532
2021-09-09 15:15:50 +03:00
Dmitry Rodionov
4fae115dc2 propagate pageserver http error messages to zenith cli 2021-09-08 17:32:59 +03:00
anastasia
3d17255400 Add comment to 'pg stop' changes 2021-09-08 14:12:00 +03:00
anastasia
5488ce8834 Change CLI command 'pg stop' to avoid races in tests.
Stop postgres immediately only when destroy option is used. Otherwise, use default shutdown mode (fast).
2021-09-08 14:12:00 +03:00
Max Sharnoff
d7313bb85c Switch tokio-postgres dependency to git repo
The other crates in this repository use zenithdb/rust-postgres as a
dependency for the related items, instead of the crates.io versions.

Switching to using that for the proxy as well removes an additional
three dependencies when we compile. (319 -> 316)
2021-09-07 19:49:03 -07:00
Dmitry Rodionov
4b73ada26e fix connection error appeared on zenith start
by binding sockets before daemonization

also use less annoying error reporting by not printing full error
messages for connect errors in first several connection retries

closes #507
2021-09-07 20:50:27 +03:00
Dmitry Rodionov
b4ecae33e4 add incremental tracking of logical timeline size
In order to exclude problems with synchronizing disk and memory logical
size is not stored in metadata on disk. It is calculated on timeline
"start" by scanning the contents of layered repo and then size is maintained
via an atomic variable.

This patch also adds new endpoint to pageserver http api: branch detail.
It allows retrieval of a particular branch info by its name. Size info
is also added to the response of the endpoint and used in tests.
2021-09-07 18:25:15 +03:00
Patrick Insinger
1b9e49eb60 pageserver - update unload() comment
Update comment to reflect changes made in 5ac4a2 and 98f496
2021-09-07 08:19:42 -07:00
Heikki Linnakangas
7a03e32dd5 Use Rust shorthand range syntax 2021-09-07 18:10:07 +03:00
Heikki Linnakangas
018a606987 Refactor code in LayerMap, for readability
- Reorder the structs and functions
- Delegate many of the operations in LayerMap to SegEntry. For example,
  `LayerMap::insert_open` now looks up the right SegEntry struct, and
  then calls `SegEntry::insert_open` on it.
- Use HashMap::entry() function with or_default() to implement the lookups
  with less code
2021-09-07 18:10:07 +03:00
Heikki Linnakangas
26782851a9 Rename OpenSegEntry to OpenLayerEntry
That's more appropriate: it's a struct that holds a Layer, not a segment.
2021-09-07 18:10:07 +03:00
Heikki Linnakangas
04ee1d5977 Add test for managing old open segments in binary heap.
I thought this test would trigger the bug fixed previous commit, but
it did not. More tests are nice in any case.
2021-09-07 18:10:07 +03:00
Heikki Linnakangas
6245702c7c Comment fixes 2021-09-07 18:10:07 +03:00
Heikki Linnakangas
9098f2159d Fix comparison routines of OpenSegEntry
Commit 66929ad6fb added a 'generation' number to open segments stored
in the layer map, to distinguish old layers from layers that were
added to the map during checkpoint processing. But it neglected the
OpenSegEntry::cmp() function.

It seems that the cmp() function is never used by BinaryHeap, so this
didn't cause any user-visible bugs (I tried adding a panic() to the
cmp() function and it didn't fire). But it's clearly wrong and we need
to fix it, anyway.
2021-09-07 18:10:07 +03:00
Kirill Bulatov
292bdaa6a7 Update documentation to note some Postgres specifics 2021-09-07 17:48:41 +03:00
anastasia
6f0c065743 preserve filediff artifacts in CI 2021-09-07 16:58:21 +03:00
anastasia
94c50e3e90 Fix check_restored_datadir_content(). Call 'basebackup' command directly, instead of relying on CLI 2021-09-07 16:58:21 +03:00
Konstantin Knizhnik
f83108002b Revert "Bump postgres version"
This reverts commit 511873aaed.
2021-09-07 15:06:43 +03:00
Konstantin Knizhnik
511873aaed Bump postgres version 2021-09-07 15:05:08 +03:00
anastasia
eb3fd7a8da print diff for mismatching files in check_restored_datadir_content() 2021-09-06 18:21:23 +03:00
Konstantin Knizhnik
a3214e982d Transaction commit redo handler should set TRANSACTION_STATUS_COMMITTED status for subtransactions, not TRANSACTION_STATUS_SUB_COMMITTED
Closes #535
2021-09-06 18:21:23 +03:00
anastasia
1e172230ce Add test funciton to compare files in compute nodes to catch bugs in SLRU replay.
Compare files in existing compute node's pgdata with fresh basebackup at the same lsn. We expect that content is identical, except tmp files
Use it after some tests.
2021-09-06 18:21:23 +03:00
Arseny Sher
51d36b9930 bump vendor/postgres 2021-09-06 13:06:20 +03:00
Arseny Sher
d1f0b1eda4 Adapt safekeepers to --sync-safekeepers walproposer mode.
1) Do epoch switch without record from new epoch, immediately after recovery --
--sync-safekeepers mode doesn't generate new records.
2) Fix commit_lsn advancement by taking into account wal we have locally --
   setting it further is incorrect.
3) Report it back to walproposer so he knows when sync is done.
4) Remove system id check as it is unknown in sync mode.

And make logging slightly better.

ref #439
2021-09-06 13:06:20 +03:00
Stas Kelvich
ed4eed0a19 Make use of postgres --sync-safekeepers in tests and CLI.
Change control plane code to call `postgres --sync-safekeepers` before
compute node start when safekeepers are enabled. Now `pg create` will
create an empty data directory with the proper config file. Subsequent
`pg start` will run `sync-safekeepers` and will call basebackup with
the resulting LSN. Also change few tests to accommodate this new behavior.
2021-09-06 13:06:20 +03:00
Konstantin Knizhnik
2cf3a70be5 Add description of Zenith changes in Postgres core (#533)
* Add description of Zenith changes in Postgres core

* Update README.md
2021-09-03 19:48:26 +03:00
Kirill Bulatov
6d42ea47bf Check rusage return code 2021-09-03 17:29:23 +03:00
Konstantin Knizhnik
b227c63edf Set proper xl_prev in basebackup, when possible.
In a passing fix two minor issues with basabackup:
* check that we can't create branches with pre-initdb LSN's
* normalize branch LSN's that are pointing to the segment boundary

patch by @knizhnik
closes #506
2021-09-03 14:58:59 +03:00
anastasia
45c09c1cdd Add LayerMap.dump() funciton for debugging.
Print timelineid in layer dumps
2021-09-03 11:00:38 +03:00
anastasia
66dcaa4e01 Rename put_unlink() to drop_relish() in Timeline trait.
Rename put_unlink() to drop_segment() in Layer trait.
2021-09-03 11:00:38 +03:00
anastasia
a7de53d4c4 Improve comments for Layer trait. 2021-09-03 11:00:38 +03:00
anastasia
fabf5ec664 Don't use term 'snapshot' to describe layers 2021-09-03 11:00:38 +03:00
Heikki Linnakangas
c6678c5dea Include # of bytes written in pgbench benchmark result
Now that the page server collects this metric (since commit 212920e47e),
let's include it in the performance test results

The new metric looks like this:

    performance/test_perf_pgbench.py .         [100%]
    --------------- Benchmark results ----------------
    test_pgbench.init: 6.784 s
    test_pgbench.pageserver_writes: 466 MB    <---- THIS IS NEW
    test_pgbench.5000_xacts: 8.196 s
    test_pgbench.size: 163 MB

    =============== 1 passed in 21.00s ===============
2021-09-03 09:00:26 +03:00
Heikki Linnakangas
1686715ad0 Partial fix for issue with extending relation with a gap.
This should fix the sporadic regression test failures we've been seeing
lately with "no base img found" errors.

This fixes the common case, but one corner case is still not handled:
If a relation is extended across a segment boundary, leaving a gap block
in the segment preceding the segment containing the target block, the
preceding segment will not be padded with zeros correctly. This adds
a test case for that, but it's commented out.

See github issue https://github.com/zenithdb/zenith/issues/500
2021-09-02 22:01:46 +03:00
Patrick Insinger
7507f4b309 zenith_utils - box BidiStream::Tls variant
Clippy warns that one variant is 40 bytes and the other is 568 bytes.
Box the larger variant to avoid this warning
2021-09-02 09:16:03 -07:00
Dmitry Rodionov
bc709561b6 fix clippy warnings 2021-09-02 18:54:44 +03:00
Kirill Bulatov
0e4cbe0165 Fix some typos 2021-09-02 17:27:18 +03:00
Heikki Linnakangas
66929ad6fb Fix infinite loop with forced repository checkpoint.
To fix, break out of the loop when you reach an in-memory layer that was
created after the checkpoint started. To do that, add a "generation"
counter into the layer map.

Fixes https://github.com/zenithdb/zenith/issues/494
2021-09-02 15:41:40 +03:00
177 changed files with 22290 additions and 6817 deletions

View File

@@ -1,20 +1,19 @@
version: 2.1
orbs:
python: circleci/python@1.4.0
executors:
zenith-build-executor:
resource_class: xlarge
docker:
- image: cimg/rust:1.52.1
- image: cimg/rust:1.56.1
zenith-python-executor:
docker:
- image: cimg/python:3.7.10 # Oldest available 3.7 with Ubuntu 20.04 (for GLIBC and Rust) at CirlceCI
jobs:
check-codestyle:
check-codestyle-rust:
executor: zenith-build-executor
steps:
- checkout
- run:
name: rustfmt
when: always
@@ -24,6 +23,12 @@ jobs:
# A job to build postgres
build-postgres:
executor: zenith-build-executor
parameters:
build_type:
type: enum
enum: ["debug", "release"]
environment:
BUILD_TYPE: << parameters.build_type >>
steps:
# Checkout the git repo (circleci doesn't have a flag to enable submodules here)
- checkout
@@ -39,7 +44,7 @@ jobs:
name: Restore postgres cache
keys:
# Restore ONLY if the rev key matches exactly
- v03-postgres-cache-{{ checksum "/tmp/cache-key-postgres" }}
- v04-postgres-cache-<< parameters.build_type >>-{{ checksum "/tmp/cache-key-postgres" }}
# FIXME We could cache our own docker container, instead of installing packages every time.
- run:
@@ -59,12 +64,12 @@ jobs:
if [ ! -e tmp_install/bin/postgres ]; then
# "depth 1" saves some time by not cloning the whole repo
git submodule update --init --depth 1
make postgres
make postgres -j8
fi
- save_cache:
name: Save postgres cache
key: v03-postgres-cache-{{ checksum "/tmp/cache-key-postgres" }}
key: v04-postgres-cache-<< parameters.build_type >>-{{ checksum "/tmp/cache-key-postgres" }}
paths:
- tmp_install
@@ -75,6 +80,8 @@ jobs:
build_type:
type: enum
enum: ["debug", "release"]
environment:
BUILD_TYPE: << parameters.build_type >>
steps:
- run:
name: apt install dependencies
@@ -96,7 +103,7 @@ jobs:
name: Restore postgres cache
keys:
# Restore ONLY if the rev key matches exactly
- v03-postgres-cache-{{ checksum "/tmp/cache-key-postgres" }}
- v04-postgres-cache-<< parameters.build_type >>-{{ checksum "/tmp/cache-key-postgres" }}
- restore_cache:
name: Restore rust cache
@@ -104,73 +111,129 @@ jobs:
# Require an exact match. While an out of date cache might speed up the build,
# there's no way to clean out old packages, so the cache grows every time something
# changes.
- v03-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
- v04-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
# Build the rust code, including test binaries
- run:
name: Rust build << parameters.build_type >>
command: |
export CARGO_INCREMENTAL=0
BUILD_TYPE="<< parameters.build_type >>"
if [[ $BUILD_TYPE == "debug" ]]; then
echo "Build in debug mode"
cargo build --bins --tests
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
CARGO_FLAGS=
elif [[ $BUILD_TYPE == "release" ]]; then
echo "Build in release mode"
cargo build --release --bins --tests
cov_prefix=()
CARGO_FLAGS=--release
fi
export CARGO_INCREMENTAL=0
"${cov_prefix[@]}" cargo build $CARGO_FLAGS --bins --tests
- save_cache:
name: Save rust cache
key: v03-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
key: v04-rust-cache-deps-<< parameters.build_type >>-{{ checksum "Cargo.lock" }}
paths:
- ~/.cargo/registry
- ~/.cargo/git
- target
# Run style checks
# has to run separately from cargo fmt section
# since needs to run with dependencies
- run:
name: cargo clippy
command: |
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
"${cov_prefix[@]}" ./run_clippy.sh
# Run rust unit tests
- run: cargo test
- run:
name: cargo test
command: |
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
"${cov_prefix[@]}" cargo test
# Install the rust binaries, for use by test jobs
# `--locked` is required; otherwise, `cargo install` will ignore Cargo.lock.
# FIXME: this is a really silly way to install; maybe we should just output
# a tarball as an artifact? Or a .deb package?
- run:
name: cargo install
name: Install rust binaries
command: |
export CARGO_INCREMENTAL=0
BUILD_TYPE="<< parameters.build_type >>"
if [[ $BUILD_TYPE == "debug" ]]; then
echo "Install debug mode"
CARGO_FLAGS="--debug"
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
echo "Install release mode"
# The default is release mode; there is no --release flag.
CARGO_FLAGS=""
cov_prefix=()
fi
binaries=$(
"${cov_prefix[@]}" cargo metadata --format-version=1 --no-deps |
jq -r '.packages[].targets[] | select(.kind | index("bin")) | .name'
)
test_exe_paths=$(
"${cov_prefix[@]}" cargo test --message-format=json --no-run |
jq -r '.executable | select(. != null)'
)
mkdir -p /tmp/zenith/bin
mkdir -p /tmp/zenith/test_bin
mkdir -p /tmp/zenith/etc
# Install target binaries
for bin in $binaries; do
SRC=target/$BUILD_TYPE/$bin
DST=/tmp/zenith/bin/$bin
cp $SRC $DST
echo $DST >> /tmp/zenith/etc/binaries.list
done
# Install test executables (for code coverage)
if [[ $BUILD_TYPE == "debug" ]]; then
for bin in $test_exe_paths; do
SRC=$bin
DST=/tmp/zenith/test_bin/$(basename $bin)
cp $SRC $DST
echo $DST >> /tmp/zenith/etc/binaries.list
done
fi
cargo install $CARGO_FLAGS --locked --root /tmp/zenith --path pageserver
cargo install $CARGO_FLAGS --locked --root /tmp/zenith --path walkeeper
cargo install $CARGO_FLAGS --locked --root /tmp/zenith --path zenith
# Install the postgres binaries, for use by test jobs
# FIXME: this is a silly way to do "install"; maybe just output a standard
# postgres package, whatever the favored form is (tarball? .deb package?)
# Note that pg_regress needs some build artifacts that probably aren't
# in the usual package...?
- run:
name: postgres install
name: Install postgres binaries
command: |
cp -a tmp_install /tmp/zenith/pg_install
# Save the rust output binaries for other jobs in this workflow.
# Save the rust binaries and coverage data for other jobs in this workflow.
- persist_to_workspace:
root: /tmp/zenith
paths:
- "*"
check-codestyle-python:
executor: zenith-python-executor
steps:
- checkout
- run:
name: Install deps
command: pipenv --python 3.7 install --dev
- run:
name: Run yapf to ensure code format
when: always
command: pipenv run yapf --recursive --diff .
- run:
name: Run mypy to check types
when: always
command: pipenv run mypy .
run-pytest:
#description: "Run pytest"
executor: python/default
executor: zenith-python-executor
parameters:
# pytest args to specify the tests to run.
#
@@ -193,6 +256,14 @@ jobs:
needs_postgres_source:
type: boolean
default: false
run_in_parallel:
type: boolean
default: true
save_perf_report:
type: boolean
default: false
environment:
BUILD_TYPE: << parameters.build_type >>
steps:
- attach_workspace:
at: /tmp/zenith
@@ -202,35 +273,74 @@ jobs:
steps:
- run: git submodule update --init --depth 1
- run:
name: Install pipenv & deps
working_directory: test_runner
command: |
pip install pipenv
pipenv install
name: Install deps
command: pipenv --python 3.7 install
- run:
name: Run pytest
working_directory: test_runner
# pytest doesn't output test logs in real time, so CI job may fail with
# `Too long with no output` error, if a test is running for a long time.
# In that case, tests should have internal timeouts that are less than
# no_output_timeout, specified here.
no_output_timeout: 10m
environment:
- ZENITH_BIN: /tmp/zenith/bin
- POSTGRES_DISTRIB_DIR: /tmp/zenith/pg_install
- TEST_OUTPUT: /tmp/test_output
# this variable will be embedded in perf test report
# and is needed to distinguish different environments
- PLATFORM: zenith-local-ci
command: |
TEST_SELECTION="<< parameters.test_selection >>"
PERF_REPORT_DIR="$(realpath test_runner/perf-report-local)"
TEST_SELECTION="test_runner/<< parameters.test_selection >>"
EXTRA_PARAMS="<< parameters.extra_params >>"
if [ -z "$TEST_SELECTION" ]; then
echo "test_selection must be set"
exit 1
fi
if << parameters.run_in_parallel >>; then
EXTRA_PARAMS="-n4 $EXTRA_PARAMS"
fi
if << parameters.save_perf_report >>; then
if [[ $CIRCLE_BRANCH == "main" ]]; then
mkdir -p "$PERF_REPORT_DIR"
EXTRA_PARAMS="--out-dir $PERF_REPORT_DIR $EXTRA_PARAMS"
fi
fi
export GITHUB_SHA=$CIRCLE_SHA1
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
# Run the tests.
#
# The junit.xml file allows CircleCI to display more fine-grained test information
# in its "Tests" tab in the results page.
# -s prevents pytest from capturing output, which helps to see
# what's going on if the test hangs
# --verbose prints name of each test (helpful when there are
# multiple tests in one file)
# -rA prints summary in the end
pipenv run pytest --junitxml=$TEST_OUTPUT/junit.xml --tb=short -s --verbose -rA $TEST_SELECTION $EXTRA_PARAMS
# -n4 uses four processes to run tests via pytest-xdist
# -s is not used to prevent pytest from capturing output, because tests are running
# in parallel and logs are mixed between different tests
"${cov_prefix[@]}" pipenv run pytest \
--junitxml=$TEST_OUTPUT/junit.xml \
--tb=short \
--verbose \
-m "not remote_cluster" \
-rA $TEST_SELECTION $EXTRA_PARAMS
if << parameters.save_perf_report >>; then
if [[ $CIRCLE_BRANCH == "main" ]]; then
# TODO: reuse scripts/git-upload
export REPORT_FROM="$PERF_REPORT_DIR"
export REPORT_TO=local
scripts/generate_and_push_perf_report.sh
fi
fi
- run:
# CircleCI artifacts are preserved one file at a time, so skipping
# this step isn't a good idea. If you want to extract the
@@ -239,13 +349,72 @@ jobs:
when: always
command: |
du -sh /tmp/test_output/*
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "wal_acceptor.log" ! -name "regression.diffs" ! -name "junit.xml" -delete
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" -delete
du -sh /tmp/test_output/*
- store_artifacts:
path: /tmp/test_output
# The store_test_results step tells CircleCI where to find the junit.xml file.
- store_test_results:
path: /tmp/test_output
# Save coverage data (if any)
- persist_to_workspace:
root: /tmp/zenith
paths:
- "*"
coverage-report:
executor: zenith-build-executor
steps:
- attach_workspace:
at: /tmp/zenith
- checkout
- restore_cache:
name: Restore rust cache
keys:
# Require an exact match. While an out of date cache might speed up the build,
# there's no way to clean out old packages, so the cache grows every time something
# changes.
- v04-rust-cache-deps-debug-{{ checksum "Cargo.lock" }}
- run:
name: Install llvm-tools
command: |
# TODO: install a proper symbol demangler, e.g. rustfilt
# TODO: we should embed this into a docker image
rustup component add llvm-tools-preview
- run:
name: Build coverage report
command: |
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
scripts/coverage \
--dir=/tmp/zenith/coverage report \
--input-objects=/tmp/zenith/etc/binaries.list \
--commit-url=$COMMIT_URL \
--format=github
- run:
name: Upload coverage report
command: |
LOCAL_REPO=$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME
REPORT_URL=https://zenithdb.github.io/zenith-coverage-data/$CIRCLE_SHA1
COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1
scripts/git-upload \
--repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-coverage-data.git \
--message="Add code coverage for $COMMIT_URL" \
copy /tmp/zenith/coverage/report $CIRCLE_SHA1 # COPY FROM TO_RELATIVE
# Add link to the coverage report to the commit
curl -f -X POST \
https://api.github.com/repos/$LOCAL_REPO/statuses/$CIRCLE_SHA1 \
-H "Accept: application/vnd.github.v3+json" \
--user "$CI_ACCESS_TOKEN" \
--data \
"{
\"state\": \"success\",
\"context\": \"zenith-coverage\",
\"description\": \"Coverage report is ready\",
\"target_url\": \"$REPORT_URL\"
}"
# Build zenithdb/zenith:latest image and push it to Docker hub
docker-image:
@@ -262,22 +431,71 @@ jobs:
name: Build and push Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
docker build -t zenithdb/zenith:latest . && docker push zenithdb/zenith:latest
docker build --build-arg GIT_VERSION=$CIRCLE_SHA1 -t zenithdb/zenith:latest . && docker push zenithdb/zenith:latest
# Trigger a new remote CI job
remote-ci-trigger:
docker:
- image: cimg/base:2021.04
parameters:
remote_repo:
type: string
environment:
REMOTE_REPO: << parameters.remote_repo >>
steps:
- run:
name: Set PR's status to pending
command: |
LOCAL_REPO=$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME
curl -f -X POST \
https://api.github.com/repos/$LOCAL_REPO/statuses/$CIRCLE_SHA1 \
-H "Accept: application/vnd.github.v3+json" \
--user "$CI_ACCESS_TOKEN" \
--data \
"{
\"state\": \"pending\",
\"context\": \"zenith-remote-ci\",
\"description\": \"[$REMOTE_REPO] Remote CI job is about to start\"
}"
- run:
name: Request a remote CI test
command: |
LOCAL_REPO=$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME
curl -f -X POST \
https://api.github.com/repos/$REMOTE_REPO/actions/workflows/testing.yml/dispatches \
-H "Accept: application/vnd.github.v3+json" \
--user "$CI_ACCESS_TOKEN" \
--data \
"{
\"ref\": \"main\",
\"inputs\": {
\"ci_job_name\": \"zenith-remote-ci\",
\"commit_hash\": \"$CIRCLE_SHA1\",
\"remote_repo\": \"$LOCAL_REPO\"
}
}"
workflows:
build_and_test:
jobs:
- check-codestyle
- build-postgres
- check-codestyle-rust
- check-codestyle-python
- build-postgres:
name: build-postgres-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
- build-zenith:
name: build-zenith-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
requires:
- build-postgres
- build-postgres-<< matrix.build_type >>
- run-pytest:
name: pg_regress tests << matrix.build_type >>
name: pg_regress-tests-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
@@ -286,7 +504,7 @@ workflows:
requires:
- build-zenith-<< matrix.build_type >>
- run-pytest:
name: other tests << matrix.build_type >>
name: other-tests-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
@@ -297,8 +515,16 @@ workflows:
name: benchmarks
build_type: release
test_selection: performance
run_in_parallel: false
save_perf_report: true
requires:
- build-zenith-release
- coverage-report:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN
requires:
# TODO: consider adding more
- other-tests-debug
- docker-image:
# Context gives an ability to login
context: Docker Hub
@@ -308,5 +534,16 @@ workflows:
only:
- main
requires:
- pg_regress tests release
- other tests release
- pg_regress-tests-release
- other-tests-release
- remote-ci-trigger:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN
remote_repo: "zenithdb/console"
requires:
# XXX: Successful build doesn't mean everything is OK, but
# the job to be triggered takes so much time to complete (~22 min)
# that it's better not to wait for the commented-out steps
- build-zenith-debug
# - pg_regress-tests-release
# - other-tests-release

View File

@@ -2,12 +2,17 @@
**/__pycache__
**/.pytest_cache
/target
/tmp_check
/tmp_install
/tmp_check_cli
/test_output
/.vscode
/.zenith
/integration_tests/.zenith
/Dockerfile
.git
target
tmp_check
tmp_install
tmp_check_cli
test_output
.vscode
.zenith
integration_tests/.zenith
.mypy_cache
Dockerfile
.dockerignore

114
.github/workflows/benchmarking.yml vendored Normal file
View File

@@ -0,0 +1,114 @@
name: benchmarking
on:
# uncomment to run on push for debugging your PR
# push:
# branches: [ mybranch ]
schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '36 7 * * *' # run once a day, timezone is utc
workflow_dispatch: # adds ability to run this manually
env:
BASE_URL: "https://console.zenith.tech"
jobs:
bench:
# this workflow runs on self hosteed runner
# it's environment is quite different from usual guthub runner
# probably the most important difference is that it doesnt start from clean workspace each time
# e g if you install system packages they are not cleaned up since you install them directly in host machine
# not a container or something
# See documentation for more info: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners
runs-on: [self-hosted, zenith-benchmarker]
env:
PG_BIN: "/usr/pgsql-13/bin"
steps:
- name: Checkout zenith repo
uses: actions/checkout@v2
# actions/setup-python@v2 is not working correctly on self-hosted runners
# see https://github.com/actions/setup-python/issues/162
# and probably https://github.com/actions/setup-python/issues/162#issuecomment-865387976 in particular
# so the simplest solution to me is to use already installed system python and spin virtualenvs for job runs.
# there is Python 3.7.10 already installed on the machine so use it to install pipenv and then use pipenv's virtuealenvs
- name: Install pipenv & deps
run: |
python3 -m pip install --upgrade pipenv wheel
# since pip/pipenv caches are reused there shouldn't be any troubles with install every time
pipenv install
- name: Show versions
run: |
echo Python
python3 --version
pipenv run python3 --version
echo Pipenv
pipenv --version
echo Pgbench
$PG_BIN/pgbench --version
# FIXME cluster setup is skipped due to various changes in console API
# for now pre created cluster is used. When API gain some stability
# after massive changes dynamic cluster setup will be revived.
# So use pre created cluster. It needs to be started manually, but stop is automatic after 5 minutes of inactivity
- name: Setup cluster
env:
BENCHMARK_CONSOLE_USER_PASSWORD: "${{ secrets.BENCHMARK_CONSOLE_USER_PASSWORD }}"
BENCHMARK_CONSOLE_ACCESS_TOKEN: "${{ secrets.BENCHMARK_CONSOLE_ACCESS_TOKEN }}"
BENCHMARK_CLUSTER_ID: "${{ secrets.BENCHMARK_CLUSTER_ID }}"
shell: bash
run: |
set -e
echo "Starting cluster"
CLUSTER=$(curl -s --fail --show-error -X POST $BASE_URL/api/v1/clusters/$BENCHMARK_CLUSTER_ID/start \
-H "Authorization: Bearer $BENCHMARK_CONSOLE_ACCESS_TOKEN")
echo $CLUSTER | python -m json.tool
echo "Waiting for cluster to become ready"
sleep 10
echo "CLUSTER_ID=$BENCHMARK_CLUSTER_ID" >> $GITHUB_ENV
CLUSTER=$(curl -s --fail --show-error -X GET $BASE_URL/api/v1/clusters/$BENCHMARK_CLUSTER_ID.json \
-H "Authorization: Bearer $BENCHMARK_CONSOLE_ACCESS_TOKEN")
echo $CLUSTER | python -m json.tool
- name: Run benchmark
# pgbench is installed system wide from official repo
# https://download.postgresql.org/pub/repos/yum/13/redhat/rhel-7-x86_64/
# via
# sudo tee /etc/yum.repos.d/pgdg.repo<<EOF
# [pgdg13]
# name=PostgreSQL 13 for RHEL/CentOS 7 - x86_64
# baseurl=https://download.postgresql.org/pub/repos/yum/13/redhat/rhel-7-x86_64/
# enabled=1
# gpgcheck=0
# EOF
# sudo yum makecache
# sudo yum install postgresql13-contrib
# actual binaries are located in /usr/pgsql-13/bin/
env:
TEST_PG_BENCH_TRANSACTIONS_MATRIX: "5000,10000,20000"
TEST_PG_BENCH_SCALES_MATRIX: "10,15"
PLATFORM: "zenith-staging"
BENCHMARK_CONSOLE_ACCESS_TOKEN: "${{ secrets.BENCHMARK_CONSOLE_ACCESS_TOKEN }}"
BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}"
REMOTE_ENV: "1" # indicate to test harness that we do not have zenith binaries locally
run: |
mkdir -p perf-report-staging
pipenv run pytest test_runner/performance/ -v -m "remote_cluster" --skip-interfering-proc-check --out-dir perf-report-staging
- name: Submit result
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
run: |
REPORT_FROM=$(realpath perf-report-staging) REPORT_TO=staging scripts/generate_and_push_perf_report.sh

4
.gitignore vendored
View File

@@ -7,3 +7,7 @@ test_output/
.vscode
/.zenith
/integration_tests/.zenith
# Coverage
*.profraw
*.profdata

10
.yapfignore Normal file
View File

@@ -0,0 +1,10 @@
# This file is only read when `yapf` is run from this directory.
# Hence we only top-level directories here to avoid confusion.
# See source code for the exact file format: https://github.com/google/yapf/blob/c6077954245bc3add82dafd853a1c7305a6ebd20/yapf/yapflib/file_resources.py#L40-L43
vendor/
target/
tmp_install/
__pycache__/
test_output/
.zenith/
.git/

785
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -10,40 +10,26 @@ FROM zenithdb/build:buster AS pg-build
WORKDIR /zenith
COPY ./vendor/postgres vendor/postgres
COPY ./Makefile Makefile
ENV BUILD_TYPE release
RUN make -j $(getconf _NPROCESSORS_ONLN) -s postgres
#
# Calculate cargo dependencies.
# This will always run, but only generate recipe.json with list of dependencies without
# installing them.
#
FROM zenithdb/build:buster AS cargo-deps-inspect
WORKDIR /zenith
COPY . .
RUN cargo chef prepare --recipe-path /zenith/recipe.json
#
# Build cargo dependencies.
# This temp cantainner should be rebuilt only if recipe.json was changed.
#
FROM zenithdb/build:buster AS deps-build
WORKDIR /zenith
COPY --from=pg-build /zenith/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY --from=cargo-deps-inspect /usr/local/cargo/bin/cargo-chef /usr/local/cargo/bin/
COPY --from=cargo-deps-inspect /zenith/recipe.json recipe.json
RUN ROCKSDB_LIB_DIR=/usr/lib/ cargo chef cook --release --recipe-path recipe.json
RUN rm -rf postgres_install/build
#
# Build zenith binaries
#
# TODO: build cargo deps as separate layer. We used cargo-chef before but that was
# net time waste in a lot of cases. Copying Cargo.lock with empty lib.rs should do the work.
#
FROM zenithdb/build:buster AS build
ARG GIT_VERSION
RUN if [ -z "$GIT_VERSION" ]; then echo "GIT_VERSION is reqired, use build_arg to pass it"; exit 1; fi
WORKDIR /zenith
COPY . .
# Copy cached dependencies
COPY --from=pg-build /zenith/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY --from=deps-build /zenith/target target
COPY --from=deps-build /usr/local/cargo/ /usr/local/cargo/
RUN cargo build --release
COPY . .
RUN GIT_VERSION=$GIT_VERSION cargo build --release
#
# Copy binaries to resulting image.
@@ -51,11 +37,11 @@ RUN cargo build --release
FROM debian:buster-slim
WORKDIR /data
RUN apt-get update && apt-get -yq install librocksdb-dev libseccomp-dev openssl && \
RUN apt-get update && apt-get -yq install libreadline-dev libseccomp-dev openssl ca-certificates && \
mkdir zenith_install
COPY --from=build /zenith/target/release/pageserver /usr/local/bin
COPY --from=build /zenith/target/release/wal_acceptor /usr/local/bin
COPY --from=build /zenith/target/release/safekeeper /usr/local/bin
COPY --from=build /zenith/target/release/proxy /usr/local/bin
COPY --from=pg-build /zenith/tmp_install postgres_install
COPY docker-entrypoint.sh /docker-entrypoint.sh

View File

@@ -81,7 +81,7 @@ FROM alpine:3.13
RUN apk add --update openssl build-base libseccomp-dev
RUN apk --no-cache --update --repository https://dl-cdn.alpinelinux.org/alpine/edge/testing add rocksdb
COPY --from=build /zenith/target/release/pageserver /usr/local/bin
COPY --from=build /zenith/target/release/wal_acceptor /usr/local/bin
COPY --from=build /zenith/target/release/safekeeper /usr/local/bin
COPY --from=build /zenith/target/release/proxy /usr/local/bin
COPY --from=pg-build /zenith/tmp_install /usr/local
COPY docker-entrypoint.sh /docker-entrypoint.sh

View File

@@ -9,7 +9,7 @@ WORKDIR /zenith
# Install postgres and zenith build dependencies
# clang is for rocksdb
RUN apt-get update && apt-get -yq install automake libtool build-essential bison flex libreadline-dev zlib1g-dev libxml2-dev \
libseccomp-dev pkg-config libssl-dev librocksdb-dev clang
libseccomp-dev pkg-config libssl-dev clang
# Install rust tools
RUN rustup component add clippy && cargo install cargo-chef cargo-audit
RUN rustup component add clippy && cargo install cargo-audit

View File

@@ -6,34 +6,55 @@ else
SECCOMP =
endif
#
# We differentiate between release / debug build types using the BUILD_TYPE
# environment variable.
#
BUILD_TYPE ?= debug
ifeq ($(BUILD_TYPE),release)
PG_CONFIGURE_OPTS = --enable-debug
PG_CFLAGS = -O2 -g3 $(CFLAGS)
# Unfortunately, `--profile=...` is a nightly feature
CARGO_BUILD_FLAGS += --release
else ifeq ($(BUILD_TYPE),debug)
PG_CONFIGURE_OPTS = --enable-debug --enable-cassert --enable-depend
PG_CFLAGS = -O0 -g3 $(CFLAGS)
else
$(error Bad build type `$(BUILD_TYPE)', see Makefile for options)
endif
# Choose whether we should be silent or verbose
CARGO_BUILD_FLAGS += --$(if $(filter s,$(MAKEFLAGS)),quiet,verbose)
# Fix for a corner case when make doesn't pass a jobserver
CARGO_BUILD_FLAGS += $(filter -j1,$(MAKEFLAGS))
# This option has a side effect of passing make jobserver to cargo.
# However, we shouldn't do this if `make -n` (--dry-run) has been asked.
CARGO_CMD_PREFIX += $(if $(filter n,$(MAKEFLAGS)),,+)
# Force cargo not to print progress bar
CARGO_CMD_PREFIX += CARGO_TERM_PROGRESS_WHEN=never CI=1
#
# Top level Makefile to build Zenith and PostgreSQL
#
.PHONY: all
all: zenith postgres
# We don't want to run 'cargo build' in parallel with the postgres build,
# because interleaving cargo build output with postgres build output looks
# confusing. Also, 'cargo build' is parallel on its own, so it would be too
# much parallelism. (Recursive invocation of postgres target still gets any
# '-j' flag from the command line, so 'make -j' is still useful.)
.NOTPARALLEL:
### Zenith Rust bits
#
# The 'postgres_ffi' depends on the Postgres headers.
.PHONY: zenith
zenith: postgres-headers
cargo build
+@echo "Compiling Zenith"
$(CARGO_CMD_PREFIX) cargo build $(CARGO_BUILD_FLAGS)
### PostgreSQL parts
tmp_install/build/config.status:
+@echo "Configuring postgres build"
mkdir -p tmp_install/build
(cd tmp_install/build && \
../../vendor/postgres/configure CFLAGS='-O0 -g3 $(CFLAGS)' \
--enable-cassert \
--enable-debug \
--enable-depend \
../../vendor/postgres/configure CFLAGS='$(PG_CFLAGS)' \
$(PG_CONFIGURE_OPTS) \
$(SECCOMP) \
--prefix=$(abspath tmp_install) > configure.log)
@@ -47,10 +68,10 @@ postgres-headers: postgres-configure
+@echo "Installing PostgreSQL headers"
$(MAKE) -C tmp_install/build/src/include MAKELEVEL=0 install
# Compile and install PostgreSQL and contrib/zenith
.PHONY: postgres
postgres: postgres-configure
postgres: postgres-configure \
postgres-headers # to prevent `make install` conflicts with zenith's `postgres-headers`
+@echo "Compiling PostgreSQL"
$(MAKE) -C tmp_install/build MAKELEVEL=0 install
+@echo "Compiling contrib/zenith"
@@ -58,18 +79,21 @@ postgres: postgres-configure
+@echo "Compiling contrib/zenith_test_utils"
$(MAKE) -C tmp_install/build/contrib/zenith_test_utils install
.PHONY: postgres-clean
postgres-clean:
$(MAKE) -C tmp_install/build MAKELEVEL=0 clean
# This doesn't remove the effects of 'configure'.
.PHONY: clean
clean:
cd tmp_install/build && ${MAKE} clean
cargo clean
cd tmp_install/build && $(MAKE) clean
$(CARGO_CMD_PREFIX) cargo clean
# This removes everything
.PHONY: distclean
distclean:
rm -rf tmp_install
cargo clean
$(CARGO_CMD_PREFIX) cargo clean
.PHONY: fmt
fmt:

View File

@@ -1 +0,0 @@
./test_runner/Pipfile

30
Pipfile Normal file
View File

@@ -0,0 +1,30 @@
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"
[packages]
pytest = ">=6.0.0"
typing-extensions = "*"
pyjwt = {extras = ["crypto"], version = "*"}
requests = "*"
pytest-xdist = "*"
asyncpg = "*"
cached-property = "*"
psycopg2-binary = "*"
jinja2 = "*"
[dev-packages]
# Behavior may change slightly between versions. These are run continuously,
# so we pin exact versions to avoid suprising breaks. Update if comfortable.
yapf = "==0.31.0"
mypy = "==0.910"
# Non-pinned packages follow.
pipenv = "*"
flake8 = "*"
types-requests = "*"
types-psycopg2 = "*"
[requires]
# we need at least 3.7, but pipenv doesn't allow to say this directly
python_version = "3"

1
Pipfile.lock generated
View File

@@ -1 +0,0 @@
./test_runner/Pipfile.lock

652
Pipfile.lock generated Normal file
View File

@@ -0,0 +1,652 @@
{
"_meta": {
"hash": {
"sha256": "c309cb963a7b07ae3d30e9cbf08b495f77bdecc0e5356fc89d133c4fbcb65b2b"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.python.org/simple",
"verify_ssl": true
}
]
},
"default": {
"asyncpg": {
"hashes": [
"sha256:129d501f3d30616afd51eb8d3142ef51ba05374256bd5834cec3ef4956a9b317",
"sha256:29ef6ae0a617fc13cc2ac5dc8e9b367bb83cba220614b437af9b67766f4b6b20",
"sha256:41704c561d354bef01353835a7846e5606faabbeb846214dfcf666cf53319f18",
"sha256:556b0e92e2b75dc028b3c4bc9bd5162ddf0053b856437cf1f04c97f9c6837d03",
"sha256:8ff5073d4b654e34bd5eaadc01dc4d68b8a9609084d835acd364cd934190a08d",
"sha256:a458fc69051fbb67d995fdda46d75a012b5d6200f91e17d23d4751482640ed4c",
"sha256:a7095890c96ba36f9f668eb552bb020dddb44f8e73e932f8573efc613ee83843",
"sha256:a738f4807c853623d3f93f0fea11f61be6b0e5ca16ea8aeb42c2c7ee742aa853",
"sha256:c4fc0205fe4ddd5aeb3dfdc0f7bafd43411181e1f5650189608e5971cceacff1",
"sha256:dd2fa063c3344823487d9ddccb40802f02622ddf8bf8a6cc53885ee7a2c1c0c6",
"sha256:ddffcb85227bf39cd1bedd4603e0082b243cf3b14ced64dce506a15b05232b83",
"sha256:e36c6806883786b19551bb70a4882561f31135dc8105a59662e0376cf5b2cbc5",
"sha256:eed43abc6ccf1dc02e0d0efc06ce46a411362f3358847c6b0ec9a43426f91ece"
],
"index": "pypi",
"version": "==0.24.0"
},
"attrs": {
"hashes": [
"sha256:149e90d6d8ac20db7a955ad60cf0e6881a3f20d37096140088356da6c716b0b1",
"sha256:ef6aaac3ca6cd92904cdd0d83f629a15f18053ec84e6432106f7a4d04ae4f5fb"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
"version": "==21.2.0"
},
"cached-property": {
"hashes": [
"sha256:9fa5755838eecbb2d234c3aa390bd80fbd3ac6b6869109bfc1b499f7bd89a130",
"sha256:df4f613cf7ad9a588cc381aaf4a512d26265ecebd5eb9e1ba12f1319eb85a6a0"
],
"index": "pypi",
"version": "==1.5.2"
},
"certifi": {
"hashes": [
"sha256:78884e7c1d4b00ce3cea67b44566851c4343c120abd683433ce934a68ea58872",
"sha256:d62a0163eb4c2344ac042ab2bdf75399a71a2d8c7d47eac2e2ee91b9d6339569"
],
"version": "==2021.10.8"
},
"cffi": {
"hashes": [
"sha256:00c878c90cb53ccfaae6b8bc18ad05d2036553e6d9d1d9dbcf323bbe83854ca3",
"sha256:0104fb5ae2391d46a4cb082abdd5c69ea4eab79d8d44eaaf79f1b1fd806ee4c2",
"sha256:06c48159c1abed75c2e721b1715c379fa3200c7784271b3c46df01383b593636",
"sha256:0808014eb713677ec1292301ea4c81ad277b6cdf2fdd90fd540af98c0b101d20",
"sha256:10dffb601ccfb65262a27233ac273d552ddc4d8ae1bf93b21c94b8511bffe728",
"sha256:14cd121ea63ecdae71efa69c15c5543a4b5fbcd0bbe2aad864baca0063cecf27",
"sha256:17771976e82e9f94976180f76468546834d22a7cc404b17c22df2a2c81db0c66",
"sha256:181dee03b1170ff1969489acf1c26533710231c58f95534e3edac87fff06c443",
"sha256:23cfe892bd5dd8941608f93348c0737e369e51c100d03718f108bf1add7bd6d0",
"sha256:263cc3d821c4ab2213cbe8cd8b355a7f72a8324577dc865ef98487c1aeee2bc7",
"sha256:2756c88cbb94231c7a147402476be2c4df2f6078099a6f4a480d239a8817ae39",
"sha256:27c219baf94952ae9d50ec19651a687b826792055353d07648a5695413e0c605",
"sha256:2a23af14f408d53d5e6cd4e3d9a24ff9e05906ad574822a10563efcef137979a",
"sha256:31fb708d9d7c3f49a60f04cf5b119aeefe5644daba1cd2a0fe389b674fd1de37",
"sha256:3415c89f9204ee60cd09b235810be700e993e343a408693e80ce7f6a40108029",
"sha256:3773c4d81e6e818df2efbc7dd77325ca0dcb688116050fb2b3011218eda36139",
"sha256:3b96a311ac60a3f6be21d2572e46ce67f09abcf4d09344c49274eb9e0bf345fc",
"sha256:3f7d084648d77af029acb79a0ff49a0ad7e9d09057a9bf46596dac9514dc07df",
"sha256:41d45de54cd277a7878919867c0f08b0cf817605e4eb94093e7516505d3c8d14",
"sha256:4238e6dab5d6a8ba812de994bbb0a79bddbdf80994e4ce802b6f6f3142fcc880",
"sha256:45db3a33139e9c8f7c09234b5784a5e33d31fd6907800b316decad50af323ff2",
"sha256:45e8636704eacc432a206ac7345a5d3d2c62d95a507ec70d62f23cd91770482a",
"sha256:4958391dbd6249d7ad855b9ca88fae690783a6be9e86df65865058ed81fc860e",
"sha256:4a306fa632e8f0928956a41fa8e1d6243c71e7eb59ffbd165fc0b41e316b2474",
"sha256:57e9ac9ccc3101fac9d6014fba037473e4358ef4e89f8e181f8951a2c0162024",
"sha256:59888172256cac5629e60e72e86598027aca6bf01fa2465bdb676d37636573e8",
"sha256:5e069f72d497312b24fcc02073d70cb989045d1c91cbd53979366077959933e0",
"sha256:64d4ec9f448dfe041705426000cc13e34e6e5bb13736e9fd62e34a0b0c41566e",
"sha256:6dc2737a3674b3e344847c8686cf29e500584ccad76204efea14f451d4cc669a",
"sha256:74fdfdbfdc48d3f47148976f49fab3251e550a8720bebc99bf1483f5bfb5db3e",
"sha256:75e4024375654472cc27e91cbe9eaa08567f7fbdf822638be2814ce059f58032",
"sha256:786902fb9ba7433aae840e0ed609f45c7bcd4e225ebb9c753aa39725bb3e6ad6",
"sha256:8b6c2ea03845c9f501ed1313e78de148cd3f6cad741a75d43a29b43da27f2e1e",
"sha256:91d77d2a782be4274da750752bb1650a97bfd8f291022b379bb8e01c66b4e96b",
"sha256:91ec59c33514b7c7559a6acda53bbfe1b283949c34fe7440bcf917f96ac0723e",
"sha256:920f0d66a896c2d99f0adbb391f990a84091179542c205fa53ce5787aff87954",
"sha256:a5263e363c27b653a90078143adb3d076c1a748ec9ecc78ea2fb916f9b861962",
"sha256:abb9a20a72ac4e0fdb50dae135ba5e77880518e742077ced47eb1499e29a443c",
"sha256:c2051981a968d7de9dd2d7b87bcb9c939c74a34626a6e2f8181455dd49ed69e4",
"sha256:c21c9e3896c23007803a875460fb786118f0cdd4434359577ea25eb556e34c55",
"sha256:c2502a1a03b6312837279c8c1bd3ebedf6c12c4228ddbad40912d671ccc8a962",
"sha256:d4d692a89c5cf08a8557fdeb329b82e7bf609aadfaed6c0d79f5a449a3c7c023",
"sha256:da5db4e883f1ce37f55c667e5c0de439df76ac4cb55964655906306918e7363c",
"sha256:e7022a66d9b55e93e1a845d8c9eba2a1bebd4966cd8bfc25d9cd07d515b33fa6",
"sha256:ef1f279350da2c586a69d32fc8733092fd32cc8ac95139a00377841f59a3f8d8",
"sha256:f54a64f8b0c8ff0b64d18aa76675262e1700f3995182267998c31ae974fbc382",
"sha256:f5c7150ad32ba43a07c4479f40241756145a1f03b43480e058cfd862bf5041c7",
"sha256:f6f824dc3bce0edab5f427efcfb1d63ee75b6fcb7282900ccaf925be84efb0fc",
"sha256:fd8a250edc26254fe5b33be00402e6d287f562b6a5b2152dec302fa15bb3e997",
"sha256:ffaa5c925128e29efbde7301d8ecaf35c8c60ffbcd6a1ffd3a552177c8e5e796"
],
"version": "==1.15.0"
},
"charset-normalizer": {
"hashes": [
"sha256:e019de665e2bcf9c2b64e2e5aa025fa991da8720daa3c1138cadd2fd1856aed0",
"sha256:f7af805c321bfa1ce6714c51f254e0d5bb5e5834039bc17db7ebe3a4cec9492b"
],
"markers": "python_version >= '3'",
"version": "==2.0.7"
},
"cryptography": {
"hashes": [
"sha256:07bb7fbfb5de0980590ddfc7f13081520def06dc9ed214000ad4372fb4e3c7f6",
"sha256:18d90f4711bf63e2fb21e8c8e51ed8189438e6b35a6d996201ebd98a26abbbe6",
"sha256:1ed82abf16df40a60942a8c211251ae72858b25b7421ce2497c2eb7a1cee817c",
"sha256:22a38e96118a4ce3b97509443feace1d1011d0571fae81fc3ad35f25ba3ea999",
"sha256:2d69645f535f4b2c722cfb07a8eab916265545b3475fdb34e0be2f4ee8b0b15e",
"sha256:4a2d0e0acc20ede0f06ef7aa58546eee96d2592c00f450c9acb89c5879b61992",
"sha256:54b2605e5475944e2213258e0ab8696f4f357a31371e538ef21e8d61c843c28d",
"sha256:7075b304cd567694dc692ffc9747f3e9cb393cc4aa4fb7b9f3abd6f5c4e43588",
"sha256:7b7ceeff114c31f285528ba8b390d3e9cfa2da17b56f11d366769a807f17cbaa",
"sha256:7eba2cebca600a7806b893cb1d541a6e910afa87e97acf2021a22b32da1df52d",
"sha256:928185a6d1ccdb816e883f56ebe92e975a262d31cc536429041921f8cb5a62fd",
"sha256:9933f28f70d0517686bd7de36166dda42094eac49415459d9bdf5e7df3e0086d",
"sha256:a688ebcd08250eab5bb5bca318cc05a8c66de5e4171a65ca51db6bd753ff8953",
"sha256:abb5a361d2585bb95012a19ed9b2c8f412c5d723a9836418fab7aaa0243e67d2",
"sha256:c10c797ac89c746e488d2ee92bd4abd593615694ee17b2500578b63cad6b93a8",
"sha256:ced40344e811d6abba00295ced98c01aecf0c2de39481792d87af4fa58b7b4d6",
"sha256:d57e0cdc1b44b6cdf8af1d01807db06886f10177469312fbde8f44ccbb284bc9",
"sha256:d99915d6ab265c22873f1b4d6ea5ef462ef797b4140be4c9d8b179915e0985c6",
"sha256:eb80e8a1f91e4b7ef8b33041591e6d89b2b8e122d787e87eeb2b08da71bb16ad",
"sha256:ebeddd119f526bcf323a89f853afb12e225902a24d29b55fe18dd6fcb2838a76"
],
"version": "==35.0.0"
},
"execnet": {
"hashes": [
"sha256:8f694f3ba9cc92cab508b152dcfe322153975c29bda272e2fd7f3f00f36e47c5",
"sha256:a295f7cc774947aac58dde7fdc85f4aa00c42adf5d8f5468fc630c1acf30a142"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
"version": "==1.9.0"
},
"idna": {
"hashes": [
"sha256:84d9dd047ffa80596e0f246e2eab0b391788b0503584e8945f2368256d2735ff",
"sha256:9d643ff0a55b762d5cdb124b8eaa99c66322e2157b69160bc32796e824360e6d"
],
"markers": "python_version >= '3'",
"version": "==3.3"
},
"importlib-metadata": {
"hashes": [
"sha256:b618b6d2d5ffa2f16add5697cf57a46c76a56229b0ed1c438322e4e95645bd15",
"sha256:f284b3e11256ad1e5d03ab86bb2ccd6f5339688ff17a4d797a0fe7df326f23b1"
],
"markers": "python_version < '3.8'",
"version": "==4.8.1"
},
"iniconfig": {
"hashes": [
"sha256:011e24c64b7f47f6ebd835bb12a743f2fbe9a26d4cecaa7f53bc4f35ee9da8b3",
"sha256:bc3af051d7d14b2ee5ef9969666def0cd1a000e121eaea580d4a313df4b37f32"
],
"version": "==1.1.1"
},
"jinja2": {
"hashes": [
"sha256:827a0e32839ab1600d4eb1c4c33ec5a8edfbc5cb42dafa13b81f182f97784b45",
"sha256:8569982d3f0889eed11dd620c706d39b60c36d6d25843961f33f77fb6bc6b20c"
],
"index": "pypi",
"version": "==3.0.2"
},
"markupsafe": {
"hashes": [
"sha256:01a9b8ea66f1658938f65b93a85ebe8bc016e6769611be228d797c9d998dd298",
"sha256:023cb26ec21ece8dc3907c0e8320058b2e0cb3c55cf9564da612bc325bed5e64",
"sha256:0446679737af14f45767963a1a9ef7620189912317d095f2d9ffa183a4d25d2b",
"sha256:04635854b943835a6ea959e948d19dcd311762c5c0c6e1f0e16ee57022669194",
"sha256:0717a7390a68be14b8c793ba258e075c6f4ca819f15edfc2a3a027c823718567",
"sha256:0955295dd5eec6cb6cc2fe1698f4c6d84af2e92de33fbcac4111913cd100a6ff",
"sha256:0d4b31cc67ab36e3392bbf3862cfbadac3db12bdd8b02a2731f509ed5b829724",
"sha256:10f82115e21dc0dfec9ab5c0223652f7197feb168c940f3ef61563fc2d6beb74",
"sha256:168cd0a3642de83558a5153c8bd34f175a9a6e7f6dc6384b9655d2697312a646",
"sha256:1d609f577dc6e1aa17d746f8bd3c31aa4d258f4070d61b2aa5c4166c1539de35",
"sha256:1f2ade76b9903f39aa442b4aadd2177decb66525062db244b35d71d0ee8599b6",
"sha256:20dca64a3ef2d6e4d5d615a3fd418ad3bde77a47ec8a23d984a12b5b4c74491a",
"sha256:2a7d351cbd8cfeb19ca00de495e224dea7e7d919659c2841bbb7f420ad03e2d6",
"sha256:2d7d807855b419fc2ed3e631034685db6079889a1f01d5d9dac950f764da3dad",
"sha256:2ef54abee730b502252bcdf31b10dacb0a416229b72c18b19e24a4509f273d26",
"sha256:36bc903cbb393720fad60fc28c10de6acf10dc6cc883f3e24ee4012371399a38",
"sha256:37205cac2a79194e3750b0af2a5720d95f786a55ce7df90c3af697bfa100eaac",
"sha256:3c112550557578c26af18a1ccc9e090bfe03832ae994343cfdacd287db6a6ae7",
"sha256:3dd007d54ee88b46be476e293f48c85048603f5f516008bee124ddd891398ed6",
"sha256:4296f2b1ce8c86a6aea78613c34bb1a672ea0e3de9c6ba08a960efe0b0a09047",
"sha256:47ab1e7b91c098ab893b828deafa1203de86d0bc6ab587b160f78fe6c4011f75",
"sha256:49e3ceeabbfb9d66c3aef5af3a60cc43b85c33df25ce03d0031a608b0a8b2e3f",
"sha256:4dc8f9fb58f7364b63fd9f85013b780ef83c11857ae79f2feda41e270468dd9b",
"sha256:4efca8f86c54b22348a5467704e3fec767b2db12fc39c6d963168ab1d3fc9135",
"sha256:53edb4da6925ad13c07b6d26c2a852bd81e364f95301c66e930ab2aef5b5ddd8",
"sha256:5855f8438a7d1d458206a2466bf82b0f104a3724bf96a1c781ab731e4201731a",
"sha256:594c67807fb16238b30c44bdf74f36c02cdf22d1c8cda91ef8a0ed8dabf5620a",
"sha256:5b6d930f030f8ed98e3e6c98ffa0652bdb82601e7a016ec2ab5d7ff23baa78d1",
"sha256:5bb28c636d87e840583ee3adeb78172efc47c8b26127267f54a9c0ec251d41a9",
"sha256:60bf42e36abfaf9aff1f50f52644b336d4f0a3fd6d8a60ca0d054ac9f713a864",
"sha256:611d1ad9a4288cf3e3c16014564df047fe08410e628f89805e475368bd304914",
"sha256:6300b8454aa6930a24b9618fbb54b5a68135092bc666f7b06901f897fa5c2fee",
"sha256:63f3268ba69ace99cab4e3e3b5840b03340efed0948ab8f78d2fd87ee5442a4f",
"sha256:6557b31b5e2c9ddf0de32a691f2312a32f77cd7681d8af66c2692efdbef84c18",
"sha256:693ce3f9e70a6cf7d2fb9e6c9d8b204b6b39897a2c4a1aa65728d5ac97dcc1d8",
"sha256:6a7fae0dd14cf60ad5ff42baa2e95727c3d81ded453457771d02b7d2b3f9c0c2",
"sha256:6c4ca60fa24e85fe25b912b01e62cb969d69a23a5d5867682dd3e80b5b02581d",
"sha256:6fcf051089389abe060c9cd7caa212c707e58153afa2c649f00346ce6d260f1b",
"sha256:7d91275b0245b1da4d4cfa07e0faedd5b0812efc15b702576d103293e252af1b",
"sha256:89c687013cb1cd489a0f0ac24febe8c7a666e6e221b783e53ac50ebf68e45d86",
"sha256:8d206346619592c6200148b01a2142798c989edcb9c896f9ac9722a99d4e77e6",
"sha256:905fec760bd2fa1388bb5b489ee8ee5f7291d692638ea5f67982d968366bef9f",
"sha256:97383d78eb34da7e1fa37dd273c20ad4320929af65d156e35a5e2d89566d9dfb",
"sha256:984d76483eb32f1bcb536dc27e4ad56bba4baa70be32fa87152832cdd9db0833",
"sha256:99df47edb6bda1249d3e80fdabb1dab8c08ef3975f69aed437cb69d0a5de1e28",
"sha256:9f02365d4e99430a12647f09b6cc8bab61a6564363f313126f775eb4f6ef798e",
"sha256:a30e67a65b53ea0a5e62fe23682cfe22712e01f453b95233b25502f7c61cb415",
"sha256:ab3ef638ace319fa26553db0624c4699e31a28bb2a835c5faca8f8acf6a5a902",
"sha256:aca6377c0cb8a8253e493c6b451565ac77e98c2951c45f913e0b52facdcff83f",
"sha256:add36cb2dbb8b736611303cd3bfcee00afd96471b09cda130da3581cbdc56a6d",
"sha256:b2f4bf27480f5e5e8ce285a8c8fd176c0b03e93dcc6646477d4630e83440c6a9",
"sha256:b7f2d075102dc8c794cbde1947378051c4e5180d52d276987b8d28a3bd58c17d",
"sha256:baa1a4e8f868845af802979fcdbf0bb11f94f1cb7ced4c4b8a351bb60d108145",
"sha256:be98f628055368795d818ebf93da628541e10b75b41c559fdf36d104c5787066",
"sha256:bf5d821ffabf0ef3533c39c518f3357b171a1651c1ff6827325e4489b0e46c3c",
"sha256:c47adbc92fc1bb2b3274c4b3a43ae0e4573d9fbff4f54cd484555edbf030baf1",
"sha256:cdfba22ea2f0029c9261a4bd07e830a8da012291fbe44dc794e488b6c9bb353a",
"sha256:d6c7ebd4e944c85e2c3421e612a7057a2f48d478d79e61800d81468a8d842207",
"sha256:d7f9850398e85aba693bb640262d3611788b1f29a79f0c93c565694658f4071f",
"sha256:d8446c54dc28c01e5a2dbac5a25f071f6653e6e40f3a8818e8b45d790fe6ef53",
"sha256:deb993cacb280823246a026e3b2d81c493c53de6acfd5e6bfe31ab3402bb37dd",
"sha256:e0f138900af21926a02425cf736db95be9f4af72ba1bb21453432a07f6082134",
"sha256:e9936f0b261d4df76ad22f8fee3ae83b60d7c3e871292cd42f40b81b70afae85",
"sha256:f0567c4dc99f264f49fe27da5f735f414c4e7e7dd850cfd8e69f0862d7c74ea9",
"sha256:f5653a225f31e113b152e56f154ccbe59eeb1c7487b39b9d9f9cdb58e6c79dc5",
"sha256:f826e31d18b516f653fe296d967d700fddad5901ae07c622bb3705955e1faa94",
"sha256:f8ba0e8349a38d3001fae7eadded3f6606f0da5d748ee53cc1dab1d6527b9509",
"sha256:f9081981fe268bd86831e5c75f7de206ef275defcb82bc70740ae6dc507aee51",
"sha256:fa130dd50c57d53368c9d59395cb5526eda596d3ffe36666cd81a44d56e48872"
],
"markers": "python_version >= '3.6'",
"version": "==2.0.1"
},
"packaging": {
"hashes": [
"sha256:096d689d78ca690e4cd8a89568ba06d07ca097e3306a4381635073ca91479966",
"sha256:14317396d1e8cdb122989b916fa2c7e9ca8e2be9e8060a6eff75b6b7b4d8a7e0"
],
"markers": "python_version >= '3.6'",
"version": "==21.2"
},
"pluggy": {
"hashes": [
"sha256:4224373bacce55f955a878bf9cfa763c1e360858e330072059e10bad68531159",
"sha256:74134bbf457f031a36d68416e1509f34bd5ccc019f0bcc952c7b909d06b37bd3"
],
"markers": "python_version >= '3.6'",
"version": "==1.0.0"
},
"psycopg2-binary": {
"hashes": [
"sha256:0b7dae87f0b729922e06f85f667de7bf16455d411971b2043bbd9577af9d1975",
"sha256:0f2e04bd2a2ab54fa44ee67fe2d002bb90cee1c0f1cc0ebc3148af7b02034cbd",
"sha256:123c3fb684e9abfc47218d3784c7b4c47c8587951ea4dd5bc38b6636ac57f616",
"sha256:1473c0215b0613dd938db54a653f68251a45a78b05f6fc21af4326f40e8360a2",
"sha256:14db1752acdd2187d99cb2ca0a1a6dfe57fc65c3281e0f20e597aac8d2a5bd90",
"sha256:1e3a362790edc0a365385b1ac4cc0acc429a0c0d662d829a50b6ce743ae61b5a",
"sha256:1e85b74cbbb3056e3656f1cc4781294df03383127a8114cbc6531e8b8367bf1e",
"sha256:20f1ab44d8c352074e2d7ca67dc00843067788791be373e67a0911998787ce7d",
"sha256:24b0b6688b9f31a911f2361fe818492650795c9e5d3a1bc647acbd7440142a4f",
"sha256:2f62c207d1740b0bde5c4e949f857b044818f734a3d57f1d0d0edc65050532ed",
"sha256:3242b9619de955ab44581a03a64bdd7d5e470cc4183e8fcadd85ab9d3756ce7a",
"sha256:35c4310f8febe41f442d3c65066ca93cccefd75013df3d8c736c5b93ec288140",
"sha256:4235f9d5ddcab0b8dbd723dca56ea2922b485ea00e1dafacf33b0c7e840b3d32",
"sha256:542875f62bc56e91c6eac05a0deadeae20e1730be4c6334d8f04c944fcd99759",
"sha256:5ced67f1e34e1a450cdb48eb53ca73b60aa0af21c46b9b35ac3e581cf9f00e31",
"sha256:661509f51531ec125e52357a489ea3806640d0ca37d9dada461ffc69ee1e7b6e",
"sha256:7360647ea04db2e7dff1648d1da825c8cf68dc5fbd80b8fb5b3ee9f068dcd21a",
"sha256:736b8797b58febabb85494142c627bd182b50d2a7ec65322983e71065ad3034c",
"sha256:8c13d72ed6af7fd2c8acbd95661cf9477f94e381fce0792c04981a8283b52917",
"sha256:988b47ac70d204aed01589ed342303da7c4d84b56c2f4c4b8b00deda123372bf",
"sha256:995fc41ebda5a7a663a254a1dcac52638c3e847f48307b5416ee373da15075d7",
"sha256:a36c7eb6152ba5467fb264d73844877be8b0847874d4822b7cf2d3c0cb8cdcb0",
"sha256:aed4a9a7e3221b3e252c39d0bf794c438dc5453bc2963e8befe9d4cd324dff72",
"sha256:aef9aee84ec78af51107181d02fe8773b100b01c5dfde351184ad9223eab3698",
"sha256:b0221ca5a9837e040ebf61f48899926b5783668b7807419e4adae8175a31f773",
"sha256:b4d7679a08fea64573c969f6994a2631908bb2c0e69a7235648642f3d2e39a68",
"sha256:c250a7ec489b652c892e4f0a5d122cc14c3780f9f643e1a326754aedf82d9a76",
"sha256:ca86db5b561b894f9e5f115d6a159fff2a2570a652e07889d8a383b5fae66eb4",
"sha256:cfc523edecddaef56f6740d7de1ce24a2fdf94fd5e704091856a201872e37f9f",
"sha256:d92272c7c16e105788efe2cfa5d680f07e34e0c29b03c1908f8636f55d5f915a",
"sha256:da113b70f6ec40e7d81b43d1b139b9db6a05727ab8be1ee559f3a69854a69d34",
"sha256:f6fac64a38f6768e7bc7b035b9e10d8a538a9fadce06b983fb3e6fa55ac5f5ce",
"sha256:f8559617b1fcf59a9aedba2c9838b5b6aa211ffedecabca412b92a1ff75aac1a",
"sha256:fbb42a541b1093385a2d8c7eec94d26d30437d0e77c1d25dae1dcc46741a385e"
],
"index": "pypi",
"version": "==2.9.1"
},
"py": {
"hashes": [
"sha256:21b81bda15b66ef5e1a777a21c4dcd9c20ad3efd0b3f817e7a809035269e1bd3",
"sha256:3b80836aa6d1feeaa108e046da6423ab8f6ceda6468545ae8d02d9d58d18818a"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==1.10.0"
},
"pycparser": {
"hashes": [
"sha256:2d475327684562c3a96cc71adf7dc8c4f0565175cf86b6d7a404ff4c771f15f0",
"sha256:7582ad22678f0fcd81102833f60ef8d0e57288b6b5fb00323d101be910e35705"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==2.20"
},
"pyjwt": {
"extras": [
"crypto"
],
"hashes": [
"sha256:b888b4d56f06f6dcd777210c334e69c737be74755d3e5e9ee3fe67dc18a0ee41",
"sha256:e0c4bb8d9f0af0c7f5b1ec4c5036309617d03d56932877f2f7a0beeb5318322f"
],
"index": "pypi",
"version": "==2.3.0"
},
"pyparsing": {
"hashes": [
"sha256:c203ec8783bf771a155b207279b9bccb8dea02d8f0c9e5f8ead507bc3246ecc1",
"sha256:ef9d7589ef3c200abe66653d3f1ab1033c3c419ae9b9bdb1240a85b024efc88b"
],
"markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==2.4.7"
},
"pytest": {
"hashes": [
"sha256:131b36680866a76e6781d13f101efb86cf674ebb9762eb70d3082b6f29889e89",
"sha256:7310f8d27bc79ced999e760ca304d69f6ba6c6649c0b60fb0e04a4a77cacc134"
],
"index": "pypi",
"version": "==6.2.5"
},
"pytest-forked": {
"hashes": [
"sha256:6aa9ac7e00ad1a539c41bec6d21011332de671e938c7637378ec9710204e37ca",
"sha256:dc4147784048e70ef5d437951728825a131b81714b398d5d52f17c7c144d8815"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
"version": "==1.3.0"
},
"pytest-xdist": {
"hashes": [
"sha256:7b61ebb46997a0820a263553179d6d1e25a8c50d8a8620cd1aa1e20e3be99168",
"sha256:89b330316f7fc475f999c81b577c2b926c9569f3d397ae432c0c2e2496d61ff9"
],
"index": "pypi",
"version": "==2.4.0"
},
"requests": {
"hashes": [
"sha256:6c1246513ecd5ecd4528a0906f910e8f0f9c6b8ec72030dc9fd154dc1a6efd24",
"sha256:b8aa58f8cf793ffd8782d3d8cb19e66ef36f7aba4353eec859e74678b01b07a7"
],
"index": "pypi",
"version": "==2.26.0"
},
"toml": {
"hashes": [
"sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b",
"sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"
],
"markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==0.10.2"
},
"typing-extensions": {
"hashes": [
"sha256:49f75d16ff11f1cd258e1b988ccff82a3ca5570217d7ad8c5f48205dd99a677e",
"sha256:d8226d10bc02a29bcc81df19a26e56a9647f8b0a6d4a83924139f4a8b01f17b7",
"sha256:f1d25edafde516b146ecd0613dabcc61409817af4766fbbcfb8d1ad4ec441a34"
],
"index": "pypi",
"version": "==3.10.0.2"
},
"urllib3": {
"hashes": [
"sha256:4987c65554f7a2dbf30c18fd48778ef124af6fab771a377103da0585e2336ece",
"sha256:c4fdf4019605b6e5423637e01bc9fe4daef873709a7973e195ceba0a62bbc844"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4' and python_version < '4'",
"version": "==1.26.7"
},
"zipp": {
"hashes": [
"sha256:71c644c5369f4a6e07636f0aa966270449561fcea2e3d6747b8d23efaa9d7832",
"sha256:9fe5ea21568a0a70e50f273397638d39b03353731e6cbbb3fd8502a33fec40bc"
],
"markers": "python_version >= '3.6'",
"version": "==3.6.0"
}
},
"develop": {
"backports.entry-points-selectable": {
"hashes": [
"sha256:988468260ec1c196dab6ae1149260e2f5472c9110334e5d51adcb77867361f6a",
"sha256:a6d9a871cde5e15b4c4a53e3d43ba890cc6861ec1332c9c2428c92f977192acc"
],
"markers": "python_version >= '2.7'",
"version": "==1.1.0"
},
"certifi": {
"hashes": [
"sha256:78884e7c1d4b00ce3cea67b44566851c4343c120abd683433ce934a68ea58872",
"sha256:d62a0163eb4c2344ac042ab2bdf75399a71a2d8c7d47eac2e2ee91b9d6339569"
],
"version": "==2021.10.8"
},
"distlib": {
"hashes": [
"sha256:c8b54e8454e5bf6237cc84c20e8264c3e991e824ef27e8f1e81049867d861e31",
"sha256:d982d0751ff6eaaab5e2ec8e691d949ee80eddf01a62eaa96ddb11531fe16b05"
],
"version": "==0.3.3"
},
"filelock": {
"hashes": [
"sha256:7afc856f74fa7006a289fd10fa840e1eebd8bbff6bffb69c26c54a0512ea8cf8",
"sha256:bb2a1c717df74c48a2d00ed625e5a66f8572a3a30baacb7657add1d7bac4097b"
],
"markers": "python_version >= '3.6'",
"version": "==3.3.2"
},
"flake8": {
"hashes": [
"sha256:479b1304f72536a55948cb40a32dce8bb0ffe3501e26eaf292c7e60eb5e0428d",
"sha256:806e034dda44114815e23c16ef92f95c91e4c71100ff52813adf7132a6ad870d"
],
"index": "pypi",
"version": "==4.0.1"
},
"importlib-metadata": {
"hashes": [
"sha256:b618b6d2d5ffa2f16add5697cf57a46c76a56229b0ed1c438322e4e95645bd15",
"sha256:f284b3e11256ad1e5d03ab86bb2ccd6f5339688ff17a4d797a0fe7df326f23b1"
],
"markers": "python_version < '3.8'",
"version": "==4.8.1"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"mypy": {
"hashes": [
"sha256:088cd9c7904b4ad80bec811053272986611b84221835e079be5bcad029e79dd9",
"sha256:0aadfb2d3935988ec3815952e44058a3100499f5be5b28c34ac9d79f002a4a9a",
"sha256:119bed3832d961f3a880787bf621634ba042cb8dc850a7429f643508eeac97b9",
"sha256:1a85e280d4d217150ce8cb1a6dddffd14e753a4e0c3cf90baabb32cefa41b59e",
"sha256:3c4b8ca36877fc75339253721f69603a9c7fdb5d4d5a95a1a1b899d8b86a4de2",
"sha256:3e382b29f8e0ccf19a2df2b29a167591245df90c0b5a2542249873b5c1d78212",
"sha256:42c266ced41b65ed40a282c575705325fa7991af370036d3f134518336636f5b",
"sha256:53fd2eb27a8ee2892614370896956af2ff61254c275aaee4c230ae771cadd885",
"sha256:704098302473cb31a218f1775a873b376b30b4c18229421e9e9dc8916fd16150",
"sha256:7df1ead20c81371ccd6091fa3e2878559b5c4d4caadaf1a484cf88d93ca06703",
"sha256:866c41f28cee548475f146aa4d39a51cf3b6a84246969f3759cb3e9c742fc072",
"sha256:a155d80ea6cee511a3694b108c4494a39f42de11ee4e61e72bc424c490e46457",
"sha256:adaeee09bfde366d2c13fe6093a7df5df83c9a2ba98638c7d76b010694db760e",
"sha256:b6fb13123aeef4a3abbcfd7e71773ff3ff1526a7d3dc538f3929a49b42be03f0",
"sha256:b94e4b785e304a04ea0828759172a15add27088520dc7e49ceade7834275bedb",
"sha256:c0df2d30ed496a08de5daed2a9ea807d07c21ae0ab23acf541ab88c24b26ab97",
"sha256:c6c2602dffb74867498f86e6129fd52a2770c48b7cd3ece77ada4fa38f94eba8",
"sha256:ceb6e0a6e27fb364fb3853389607cf7eb3a126ad335790fa1e14ed02fba50811",
"sha256:d9dd839eb0dc1bbe866a288ba3c1afc33a202015d2ad83b31e875b5905a079b6",
"sha256:e4dab234478e3bd3ce83bac4193b2ecd9cf94e720ddd95ce69840273bf44f6de",
"sha256:ec4e0cd079db280b6bdabdc807047ff3e199f334050db5cbb91ba3e959a67504",
"sha256:ecd2c3fe726758037234c93df7e98deb257fd15c24c9180dacf1ef829da5f921",
"sha256:ef565033fa5a958e62796867b1df10c40263ea9ded87164d67572834e57a174d"
],
"index": "pypi",
"version": "==0.910"
},
"mypy-extensions": {
"hashes": [
"sha256:090fedd75945a69ae91ce1303b5824f428daf5a028d2f6ab8a299250a846f15d",
"sha256:2d82818f5bb3e369420cb3c4060a7970edba416647068eb4c5343488a6c604a8"
],
"version": "==0.4.3"
},
"pipenv": {
"hashes": [
"sha256:05958fadcd70b2de6a27542fcd2bd72dd5c59c6d35307fdac3e06361fb06e30e",
"sha256:d180f5be4775c552fd5e69ae18a9d6099d9dafb462efe54f11c72cb5f4d5e977"
],
"index": "pypi",
"version": "==2021.5.29"
},
"platformdirs": {
"hashes": [
"sha256:367a5e80b3d04d2428ffa76d33f124cf11e8fff2acdaa9b43d545f5c7d661ef2",
"sha256:8868bbe3c3c80d42f20156f22e7131d2fb321f5bc86a2a345375c6481a67021d"
],
"markers": "python_version >= '3.6'",
"version": "==2.4.0"
},
"pycodestyle": {
"hashes": [
"sha256:720f8b39dde8b293825e7ff02c475f3077124006db4f440dcbc9a20b76548a20",
"sha256:eddd5847ef438ea1c7870ca7eb78a9d47ce0cdb4851a5523949f2601d0cbbe7f"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
"version": "==2.8.0"
},
"pyflakes": {
"hashes": [
"sha256:05a85c2872edf37a4ed30b0cce2f6093e1d0581f8c19d7393122da7e25b2b24c",
"sha256:3bb3a3f256f4b7968c9c788781e4ff07dce46bdf12339dcda61053375426ee2e"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==2.4.0"
},
"six": {
"hashes": [
"sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926",
"sha256:8abb2f1d86890a2dfb989f9a77cfcfd3e47c2a354b01111771326f8aa26e0254"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==1.16.0"
},
"toml": {
"hashes": [
"sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b",
"sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"
],
"markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==0.10.2"
},
"typed-ast": {
"hashes": [
"sha256:01ae5f73431d21eead5015997ab41afa53aa1fbe252f9da060be5dad2c730ace",
"sha256:067a74454df670dcaa4e59349a2e5c81e567d8d65458d480a5b3dfecec08c5ff",
"sha256:0fb71b8c643187d7492c1f8352f2c15b4c4af3f6338f21681d3681b3dc31a266",
"sha256:1b3ead4a96c9101bef08f9f7d1217c096f31667617b58de957f690c92378b528",
"sha256:2068531575a125b87a41802130fa7e29f26c09a2833fea68d9a40cf33902eba6",
"sha256:209596a4ec71d990d71d5e0d312ac935d86930e6eecff6ccc7007fe54d703808",
"sha256:2c726c276d09fc5c414693a2de063f521052d9ea7c240ce553316f70656c84d4",
"sha256:398e44cd480f4d2b7ee8d98385ca104e35c81525dd98c519acff1b79bdaac363",
"sha256:52b1eb8c83f178ab787f3a4283f68258525f8d70f778a2f6dd54d3b5e5fb4341",
"sha256:5feca99c17af94057417d744607b82dd0a664fd5e4ca98061480fd8b14b18d04",
"sha256:7538e495704e2ccda9b234b82423a4038f324f3a10c43bc088a1636180f11a41",
"sha256:760ad187b1041a154f0e4d0f6aae3e40fdb51d6de16e5c99aedadd9246450e9e",
"sha256:777a26c84bea6cd934422ac2e3b78863a37017618b6e5c08f92ef69853e765d3",
"sha256:95431a26309a21874005845c21118c83991c63ea800dd44843e42a916aec5899",
"sha256:9ad2c92ec681e02baf81fdfa056fe0d818645efa9af1f1cd5fd6f1bd2bdfd805",
"sha256:9c6d1a54552b5330bc657b7ef0eae25d00ba7ffe85d9ea8ae6540d2197a3788c",
"sha256:aee0c1256be6c07bd3e1263ff920c325b59849dc95392a05f258bb9b259cf39c",
"sha256:af3d4a73793725138d6b334d9d247ce7e5f084d96284ed23f22ee626a7b88e39",
"sha256:b36b4f3920103a25e1d5d024d155c504080959582b928e91cb608a65c3a49e1a",
"sha256:b9574c6f03f685070d859e75c7f9eeca02d6933273b5e69572e5ff9d5e3931c3",
"sha256:bff6ad71c81b3bba8fa35f0f1921fb24ff4476235a6e94a26ada2e54370e6da7",
"sha256:c190f0899e9f9f8b6b7863debfb739abcb21a5c054f911ca3596d12b8a4c4c7f",
"sha256:c907f561b1e83e93fad565bac5ba9c22d96a54e7ea0267c708bffe863cbe4075",
"sha256:cae53c389825d3b46fb37538441f75d6aecc4174f615d048321b716df2757fb0",
"sha256:dd4a21253f42b8d2b48410cb31fe501d32f8b9fbeb1f55063ad102fe9c425e40",
"sha256:dde816ca9dac1d9c01dd504ea5967821606f02e510438120091b84e852367428",
"sha256:f2362f3cb0f3172c42938946dbc5b7843c2a28aec307c49100c8b38764eb6927",
"sha256:f328adcfebed9f11301eaedfa48e15bdece9b519fb27e6a8c01aa52a17ec31b3",
"sha256:f8afcf15cc511ada719a88e013cec87c11aff7b91f019295eb4530f96fe5ef2f",
"sha256:fb1bbeac803adea29cedd70781399c99138358c26d05fcbd23c13016b7f5ec65"
],
"markers": "python_version < '3.8'",
"version": "==1.4.3"
},
"types-psycopg2": {
"hashes": [
"sha256:77ed80f2668582654623e04fb3d741ecce93effcc39c929d7e02f4a917a538ce",
"sha256:98a6e0e9580cd7eb4bd4d20f7c7063d154b2589a2b90c0ce4e3ca6085cde77c6"
],
"index": "pypi",
"version": "==2.9.1"
},
"types-requests": {
"hashes": [
"sha256:b279284e51f668e38ee12d9665e4d789089f532dc2a0be4a1508ca0efd98ba9e",
"sha256:ba1d108d512e294b6080c37f6ae7cb2a2abf527560e2b671d1786c1fc46b541a"
],
"index": "pypi",
"version": "==2.25.11"
},
"typing-extensions": {
"hashes": [
"sha256:49f75d16ff11f1cd258e1b988ccff82a3ca5570217d7ad8c5f48205dd99a677e",
"sha256:d8226d10bc02a29bcc81df19a26e56a9647f8b0a6d4a83924139f4a8b01f17b7",
"sha256:f1d25edafde516b146ecd0613dabcc61409817af4766fbbcfb8d1ad4ec441a34"
],
"index": "pypi",
"version": "==3.10.0.2"
},
"virtualenv": {
"hashes": [
"sha256:4b02e52a624336eece99c96e3ab7111f469c24ba226a53ec474e8e787b365814",
"sha256:576d05b46eace16a9c348085f7d0dc8ef28713a2cabaa1cf0aea41e8f12c9218"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
"version": "==20.10.0"
},
"virtualenv-clone": {
"hashes": [
"sha256:418ee935c36152f8f153c79824bb93eaf6f0f7984bae31d3f48f350b9183501a",
"sha256:44d5263bceed0bac3e1424d64f798095233b64def1c5689afa43dc3223caf5b0"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==0.5.7"
},
"yapf": {
"hashes": [
"sha256:408fb9a2b254c302f49db83c59f9aa0b4b0fd0ec25be3a5c51181327922ff63d",
"sha256:e3a234ba8455fe201eaa649cdac872d590089a18b661e39bbac7020978dd9c2e"
],
"index": "pypi",
"version": "==0.31.0"
},
"zipp": {
"hashes": [
"sha256:71c644c5369f4a6e07636f0aa966270449561fcea2e3d6747b8d23efaa9d7832",
"sha256:9fe5ea21568a0a70e50f273397638d39b03353731e6cbbb3fd8502a33fec40bc"
],
"markers": "python_version >= '3.6'",
"version": "==3.6.0"
}
}
}

View File

@@ -6,7 +6,7 @@ Zenith substitutes PostgreSQL storage layer and redistributes data across a clus
A Zenith installation consists of Compute nodes and Storage engine.
Compute nodes are stateles PostgreSQL nodes, backed by zenith storage.
Compute nodes are stateless PostgreSQL nodes, backed by zenith storage.
Zenith storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
@@ -25,15 +25,15 @@ Pageserver consists of:
On Ubuntu or Debian this set of packages should be sufficient to build the code:
```text
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang
libssl-dev clang pkg-config libpq-dev
```
[Rust] 1.52 or later is also required.
[Rust] 1.55 or later is also required.
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests (not required to use the code), install
Python (3.6 or higher), and install python3 packages with `pipenv` using `pipenv install` in the project directory.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `pipenv install` in the project directory.
2. Build zenith and patched postgres
```sh
@@ -47,17 +47,26 @@ make -j5
# Create repository in .zenith with proper paths to binaries and data
# Later that would be responsibility of a package install script
> ./target/debug/zenith init
initializing tenantid c03ba6b7ad4c5e9cf556f059ade44229
created initial timeline 5b014a9e41b4b63ce1a1febc04503636 timeline.lsn 0/169C3C8
created main branch
pageserver init succeeded
# start pageserver
# start pageserver and safekeeper
> ./target/debug/zenith start
Starting pageserver at '127.0.0.1:64000' in .zenith
Starting pageserver at 'localhost:64000' in '.zenith'
Pageserver started
initializing for single for 7676
Starting safekeeper at 'localhost:5454' in '.zenith/safekeepers/single'
Safekeeper started
# start postgres on top on the pageserver
# start postgres compute node
> ./target/debug/zenith pg start main
Starting postgres node at 'host=127.0.0.1 port=55432 user=stas'
Starting new postgres main on main...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/c03ba6b7ad4c5e9cf556f059ade44229/main port=55432
Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres'
waiting for server to start.... done
server started
# check list of running postgres instances
> ./target/debug/zenith pg list
@@ -108,13 +117,19 @@ postgres=# insert into t values(2,2);
INSERT 0 1
```
6. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances
you have just started. You can stop them all with one command:
```sh
> ./target/debug/zenith stop
```
## Running tests
```sh
git clone --recursive https://github.com/zenithdb/zenith.git
make # builds also postgres and installs it to ./tmp_install
cd test_runner
pytest
pipenv run pytest
```
## Documentation
@@ -125,9 +140,19 @@ Now we use README files to cover design ideas and overall architecture for each
To view your `rustdoc` documentation in a browser, try running `cargo doc --no-deps --open`
### Postgres-specific terms
Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used.
Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
To get more familiar with this aspect, refer to:
- [Zenith glossary](/docs/glossary.md)
- [PostgreSQL glossary](https://www.postgresql.org/docs/13/glossary.html)
- Other PostgreSQL documentation and sources (Zenith fork sources can be found [here](https://github.com/zenithdb/postgres))
## Join the development
- Read `CONTRIBUTING.md` to learn about project code style and practices.
- Use glossary in [/docs/glossary.md](/docs/glossary.md)
- To get familiar with a source tree layout, use [/docs/sourcetree.md](/docs/sourcetree.md).
- To learn more about PostgreSQL internals, check http://www.interdb.jp/pg/index.html

View File

@@ -16,8 +16,9 @@ toml = "0.5"
lazy_static = "1.4"
regex = "1"
anyhow = "1.0"
thiserror = "1"
bytes = "1.0.1"
nix = "0.20"
nix = "0.23"
url = "2.2.2"
hex = { version = "0.4.3", features = ["serde"] }
reqwest = { version = "0.11", features = ["blocking", "json"] }

View File

@@ -0,0 +1,20 @@
# Page server and three safekeepers.
[pageserver]
pg_port = 64000
http_port = 9898
auth_type = 'Trust'
[[safekeepers]]
name = 'sk1'
pg_port = 5454
http_port = 7676
[[safekeepers]]
name = 'sk2'
pg_port = 5455
http_port = 7677
[[safekeepers]]
name = 'sk3'
pg_port = 5456
http_port = 7678

11
control_plane/simple.conf Normal file
View File

@@ -0,0 +1,11 @@
# Minimal zenith environment with one safekeeper. This is equivalent to the built-in
# defaults that you get with no --config
[pageserver]
pg_port = 64000
http_port = 9898
auth_type = 'Trust'
[[safekeepers]]
name = 'single'
pg_port = 5454
http_port = 7676

View File

@@ -1,18 +1,16 @@
use std::fs::{self, File, OpenOptions};
use std::collections::BTreeMap;
use std::fs::{self, File};
use std::io::Write;
use std::net::SocketAddr;
use std::net::TcpStream;
use std::os::unix::fs::PermissionsExt;
use std::process::Command;
use std::path::PathBuf;
use std::process::{Command, Stdio};
use std::str::FromStr;
use std::sync::Arc;
use std::time::Duration;
use std::{collections::BTreeMap, path::PathBuf};
use anyhow::{Context, Result};
use lazy_static::lazy_static;
use postgres_ffi::pg_constants;
use regex::Regex;
use zenith_utils::connstring::connection_host_port;
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::AuthType;
@@ -20,6 +18,7 @@ use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
use crate::local_env::LocalEnv;
use crate::postgresql_conf::PostgresConf;
use crate::storage::PageServerNode;
//
@@ -40,8 +39,6 @@ impl ComputeControlPlane {
// | |- <tenant_id>
// | | |- <branch name>
pub fn load(env: LocalEnv) -> Result<ComputeControlPlane> {
// TODO: since pageserver do not have config file yet we believe here that
// it is running on default port. Change that when pageserver will have config.
let pageserver = Arc::new(PageServerNode::from_env(&env));
let mut nodes = BTreeMap::default();
@@ -76,38 +73,59 @@ impl ComputeControlPlane {
.unwrap_or(self.base_port)
}
pub fn local(local_env: &LocalEnv, pageserver: &Arc<PageServerNode>) -> ComputeControlPlane {
ComputeControlPlane {
base_port: 65431,
pageserver: Arc::clone(pageserver),
nodes: BTreeMap::new(),
env: local_env.clone(),
// FIXME: see also parse_point_in_time in branches.rs.
fn parse_point_in_time(
&self,
tenantid: ZTenantId,
s: &str,
) -> Result<(ZTimelineId, Option<Lsn>)> {
let mut strings = s.split('@');
let name = strings.next().unwrap();
let lsn: Option<Lsn>;
if let Some(lsnstr) = strings.next() {
lsn = Some(
Lsn::from_str(lsnstr)
.with_context(|| "invalid LSN in point-in-time specification")?,
);
} else {
lsn = None
}
// Resolve the timeline ID, given the human-readable branch name
let timeline_id = self
.pageserver
.branch_get_by_name(&tenantid, name)?
.timeline_id;
Ok((timeline_id, lsn))
}
pub fn new_node(
&mut self,
tenantid: ZTenantId,
branch_name: &str,
name: &str,
timeline_spec: &str,
port: Option<u16>,
) -> Result<Arc<PostgresNode>> {
let timeline_id = self
.pageserver
.branch_get_by_name(&tenantid, branch_name)?
.timeline_id;
// Resolve the human-readable timeline spec into timeline ID and LSN
let (timelineid, lsn) = self.parse_point_in_time(tenantid, timeline_spec)?;
let port = port.unwrap_or_else(|| self.get_port());
let node = Arc::new(PostgresNode {
name: branch_name.to_owned(),
address: SocketAddr::new("127.0.0.1".parse().unwrap(), self.get_port()),
name: name.to_owned(),
address: SocketAddr::new("127.0.0.1".parse().unwrap(), port),
env: self.env.clone(),
pageserver: Arc::clone(&self.pageserver),
is_test: false,
timelineid: timeline_id,
timelineid,
lsn,
tenantid,
uses_wal_proposer: false,
});
node.create_pgdata()?;
node.setup_pg_conf(self.env.auth_type)?;
node.setup_pg_conf(self.env.pageserver.auth_type)?;
self.nodes
.insert((tenantid, node.name.clone()), Arc::clone(&node));
@@ -126,6 +144,7 @@ pub struct PostgresNode {
pageserver: Arc<PageServerNode>,
is_test: bool,
pub timelineid: ZTimelineId,
pub lsn: Option<Lsn>, // if it's a read-only node. None for primary
pub tenantid: ZTenantId,
uses_wal_proposer: bool,
}
@@ -143,76 +162,28 @@ impl PostgresNode {
);
}
lazy_static! {
static ref CONF_PORT_RE: Regex = Regex::new(r"(?m)^\s*port\s*=\s*(\d+)\s*$").unwrap();
static ref CONF_TIMELINE_RE: Regex =
Regex::new(r"(?m)^\s*zenith.zenith_timeline\s*=\s*'(\w+)'\s*$").unwrap();
static ref CONF_TENANT_RE: Regex =
Regex::new(r"(?m)^\s*zenith.zenith_tenant\s*=\s*'(\w+)'\s*$").unwrap();
}
// parse data directory name
let fname = entry.file_name();
let name = fname.to_str().unwrap().to_string();
// find out tcp port in config file
// Read config file into memory
let cfg_path = entry.path().join("postgresql.conf");
let config = fs::read_to_string(cfg_path.clone()).with_context(|| {
format!(
"failed to read config file in {}",
cfg_path.to_str().unwrap()
)
})?;
let cfg_path_str = cfg_path.to_string_lossy();
let mut conf_file = File::open(&cfg_path)
.with_context(|| format!("failed to open config file in {}", cfg_path_str))?;
let conf = PostgresConf::read(&mut conf_file)
.with_context(|| format!("failed to read config file in {}", cfg_path_str))?;
// parse port
let err_msg = format!(
"failed to find port definition in config file {}",
cfg_path.to_str().unwrap()
);
let port: u16 = CONF_PORT_RE
.captures(config.as_str())
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 1"))?
.iter()
.last()
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 2"))?
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 3"))?
.as_str()
.parse()
.with_context(|| err_msg)?;
// Read a few options from the config file
let context = format!("in config file {}", cfg_path_str);
let port: u16 = conf.parse_field("port", &context)?;
let timelineid: ZTimelineId = conf.parse_field("zenith.zenith_timeline", &context)?;
let tenantid: ZTenantId = conf.parse_field("zenith.zenith_tenant", &context)?;
let uses_wal_proposer = conf.get("wal_acceptors").is_some();
// parse timeline
let err_msg = format!(
"failed to find timeline definition in config file {}",
cfg_path.to_str().unwrap()
);
let timelineid: ZTimelineId = CONF_TIMELINE_RE
.captures(config.as_str())
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 1"))?
.iter()
.last()
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 2"))?
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 3"))?
.as_str()
.parse()
.with_context(|| err_msg)?;
// parse tenant
let err_msg = format!(
"failed to find tenant definition in config file {}",
cfg_path.to_str().unwrap()
);
let tenantid = CONF_TENANT_RE
.captures(config.as_str())
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 1"))?
.iter()
.last()
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 2"))?
.ok_or_else(|| anyhow::Error::msg(err_msg.clone() + " 3"))?
.as_str()
.parse()
.with_context(|| err_msg)?;
let uses_wal_proposer = config.contains("wal_acceptors");
// parse recovery_target_lsn, if any
let recovery_target_lsn: Option<Lsn> =
conf.parse_field_optional("recovery_target_lsn", &context)?;
// ok now
Ok(PostgresNode {
@@ -222,31 +193,38 @@ impl PostgresNode {
pageserver: Arc::clone(pageserver),
is_test: false,
timelineid,
lsn: recovery_target_lsn,
tenantid,
uses_wal_proposer,
})
}
fn sync_walkeepers(&self) -> Result<Lsn> {
fn sync_safekeepers(&self) -> Result<Lsn> {
let pg_path = self.env.pg_bin_dir().join("postgres");
let sync_output = Command::new(pg_path)
let sync_handle = Command::new(pg_path)
.arg("--sync-safekeepers")
.env_clear()
.env("LD_LIBRARY_PATH", self.env.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", self.env.pg_lib_dir().to_str().unwrap())
.env("PGDATA", self.pgdata().to_str().unwrap())
.output()
.with_context(|| "sync-walkeepers failed")?;
.stdout(Stdio::piped())
// Comment this to avoid capturing stderr (useful if command hangs)
.stderr(Stdio::piped())
.spawn()
.expect("postgres --sync-safekeepers failed to start");
let sync_output = sync_handle
.wait_with_output()
.expect("postgres --sync-safekeepers failed");
if !sync_output.status.success() {
anyhow::bail!(
"sync-walkeepers failed: '{}'",
"sync-safekeepers failed: '{}'",
String::from_utf8_lossy(&sync_output.stderr)
);
}
let lsn = Lsn::from_str(std::str::from_utf8(&sync_output.stdout)?.trim())?;
println!("Walkeepers synced on {}", lsn);
println!("Safekeepers synced on {}", lsn);
Ok(lsn)
}
@@ -277,7 +255,7 @@ impl PostgresNode {
// Read the archive directly from the `CopyOutReader`
tar::Archive::new(copyreader)
.unpack(&self.pgdata())
.with_context(|| "extracting page backup failed")?;
.with_context(|| "extracting base backup failed")?;
Ok(())
}
@@ -301,84 +279,96 @@ impl PostgresNode {
// Connect to a page server, get base backup, and untar it to initialize a
// new data directory
fn setup_pg_conf(&self, auth_type: AuthType) -> Result<()> {
File::create(self.pgdata().join("postgresql.conf").to_str().unwrap())?;
let mut conf = PostgresConf::new();
conf.append("max_wal_senders", "10");
// wal_log_hints is mandatory when running against pageserver (see gh issue#192)
// TODO: is it possible to check wal_log_hints at pageserver side via XLOG_PARAMETER_CHANGE?
self.append_conf(
"postgresql.conf",
&format!(
"max_wal_senders = 10\n\
wal_log_hints = on\n\
max_replication_slots = 10\n\
hot_standby = on\n\
shared_buffers = 1MB\n\
fsync = off\n\
max_connections = 100\n\
wal_sender_timeout = 0\n\
wal_level = replica\n\
listen_addresses = '{address}'\n\
port = {port}\n",
address = self.address.ip(),
port = self.address.port()
),
)?;
conf.append("wal_log_hints", "on");
conf.append("max_replication_slots", "10");
conf.append("hot_standby", "on");
conf.append("shared_buffers", "1MB");
conf.append("max_wal_size", "100GB");
conf.append("fsync", "off");
conf.append("max_connections", "100");
conf.append("wal_level", "replica");
// wal_sender_timeout is the maximum time to wait for WAL replication.
// It also defines how often the walreciever will send a feedback message to the wal sender.
//conf.append("wal_sender_timeout", "5s");
//conf.append("max_replication_flush_lag", "160MB");
//conf.append("max_replication_apply_lag", "1500MB");
conf.append("listen_addresses", &self.address.ip().to_string());
conf.append("port", &self.address.port().to_string());
// Never clean up old WAL. TODO: We should use a replication
// slot or something proper, to prevent the compute node
// from removing WAL that hasn't been streamed to the safekeeper or
// page server yet. (gh issue #349)
self.append_conf("postgresql.conf", "wal_keep_size='10TB'\n")?;
conf.append("wal_keep_size", "10TB");
// set up authentication
let password = if let AuthType::ZenithJWT = auth_type {
"$ZENITH_AUTH_TOKEN"
} else {
""
// Configure the node to fetch pages from pageserver
let pageserver_connstr = {
let (host, port) = connection_host_port(&self.pageserver.pg_connection_config);
// Set up authentication
//
// $ZENITH_AUTH_TOKEN will be replaced with value from environment
// variable during compute pg startup. It is done this way because
// otherwise user will be able to retrieve the value using SHOW
// command or pg_settings
let password = if let AuthType::ZenithJWT = auth_type {
"$ZENITH_AUTH_TOKEN"
} else {
""
};
format!("host={} port={} password={}", host, port, password)
};
conf.append("shared_preload_libraries", "zenith");
conf.append_line("");
conf.append("zenith.page_server_connstring", &pageserver_connstr);
conf.append("zenith.zenith_tenant", &self.tenantid.to_string());
conf.append("zenith.zenith_timeline", &self.timelineid.to_string());
if let Some(lsn) = self.lsn {
conf.append("recovery_target_lsn", &lsn.to_string());
}
conf.append_line("");
// Configure that node to take pages from pageserver
let (host, port) = connection_host_port(&self.pageserver.pg_connection_config);
self.append_conf(
"postgresql.conf",
format!(
concat!(
"shared_preload_libraries = zenith\n",
// $ZENITH_AUTH_TOKEN will be replaced with value from environment variable during compute pg startup
// it is done this way because otherwise user will be able to retrieve the value using SHOW command or pg_settings
"zenith.page_server_connstring = 'host={} port={} password={}'\n",
"zenith.zenith_timeline='{}'\n",
"zenith.zenith_tenant='{}'\n",
),
host, port, password, self.timelineid, self.tenantid,
)
.as_str(),
)?;
if !self.env.safekeepers.is_empty() {
// Configure the node to connect to the safekeepers
conf.append("synchronous_standby_names", "walproposer");
// Configure the node to stream WAL directly to the pageserver
self.append_conf(
"postgresql.conf",
format!(
concat!(
"synchronous_standby_names = 'pageserver'\n", // TODO: add a new function arg?
"zenith.callmemaybe_connstring = '{}'\n", // FIXME escaping
),
self.connstr(),
)
.as_str(),
)?;
let wal_acceptors = self
.env
.safekeepers
.iter()
.map(|sk| format!("localhost:{}", sk.pg_port))
.collect::<Vec<String>>()
.join(",");
conf.append("wal_acceptors", &wal_acceptors);
} else {
// Configure the node to stream WAL directly to the pageserver
// This isn't really a supported configuration, but can be useful for
// testing.
conf.append("synchronous_standby_names", "pageserver");
conf.append("zenith.callmemaybe_connstring", &self.connstr());
}
let mut file = File::create(self.pgdata().join("postgresql.conf"))?;
file.write_all(conf.to_string().as_bytes())?;
Ok(())
}
fn load_basebackup(&self) -> Result<()> {
let lsn = if self.uses_wal_proposer {
// LSN WAL_SEGMENT_SIZE means that it is bootstrap and we need to download just
let backup_lsn = if let Some(lsn) = self.lsn {
Some(lsn)
} else if self.uses_wal_proposer {
// LSN 0 means that it is bootstrap and we need to download just
// latest data from the pageserver. That is a bit clumsy but whole bootstrap
// procedure evolves quite actively right now, so let's think about it again
// when things would be more stable (TODO).
let lsn = self.sync_walkeepers()?;
if lsn == Lsn(pg_constants::WAL_SEGMENT_SIZE as u64) {
let lsn = self.sync_safekeepers()?;
if lsn == Lsn(0) {
None
} else {
Some(lsn)
@@ -387,7 +377,7 @@ impl PostgresNode {
None
};
self.do_basebackup(lsn)?;
self.do_basebackup(backup_lsn)?;
Ok(())
}
@@ -409,14 +399,6 @@ impl PostgresNode {
}
}
pub fn append_conf(&self, config: &str, opts: &str) -> Result<()> {
OpenOptions::new()
.append(true)
.open(self.pgdata().join(config).to_str().unwrap())?
.write_all(opts.as_bytes())?;
Ok(())
}
fn pg_ctl(&self, args: &[&str], auth_token: &Option<String>) -> Result<()> {
let pg_ctl_path = self.env.pg_bin_dir().join("pg_ctl");
let mut cmd = Command::new(pg_ctl_path);
@@ -472,6 +454,10 @@ impl PostgresNode {
// 3. Load basebackup
self.load_basebackup()?;
if self.lsn.is_some() {
File::create(self.pgdata().join("standby.signal"))?;
}
// 4. Finally start the compute node postgres
println!("Starting postgres node at '{}'", self.connstr());
self.pg_ctl(&["start"], auth_token)
@@ -482,13 +468,22 @@ impl PostgresNode {
}
pub fn stop(&self, destroy: bool) -> Result<()> {
self.pg_ctl(&["-m", "immediate", "stop"], &None)?;
// If we are going to destroy data directory,
// use immediate shutdown mode, otherwise,
// shutdown gracefully to leave the data directory sane.
//
// Compute node always starts from scratch, so stop
// without destroy only used for testing and debugging.
//
if destroy {
self.pg_ctl(&["-m", "immediate", "stop"], &None)?;
println!(
"Destroying postgres data directory '{}'",
self.pgdata().to_str().unwrap()
);
fs::remove_dir_all(&self.pgdata())?;
} else {
self.pg_ctl(&["stop"], &None)?;
}
Ok(())
}
@@ -509,9 +504,7 @@ impl PostgresNode {
.output()
.expect("failed to execute whoami");
if !output.status.success() {
panic!("whoami failed");
}
assert!(output.status.success(), "whoami failed");
String::from_utf8(output.stdout).unwrap().trim().to_string()
}

View File

@@ -12,6 +12,8 @@ use std::path::Path;
pub mod compute;
pub mod local_env;
pub mod postgresql_conf;
pub mod safekeeper;
pub mod storage;
/// Read a PID file

View File

@@ -4,54 +4,105 @@
// Now it also provides init method which acts like a stub for proper installation
// script which will use local paths.
//
use anyhow::{anyhow, Context, Result};
use hex;
use anyhow::{Context, Result};
use serde::{Deserialize, Serialize};
use std::env;
use std::fmt::Write;
use std::fs;
use std::path::PathBuf;
use std::path::{Path, PathBuf};
use std::process::{Command, Stdio};
use std::{collections::BTreeMap, env};
use url::Url;
use zenith_utils::auth::{encode_from_key_path, Claims, Scope};
use zenith_utils::auth::{encode_from_key_file, Claims, Scope};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::ZTenantId;
pub type Remotes = BTreeMap<String, String>;
//
// This data structures represent deserialized zenith CLI config
// This data structures represents zenith CLI config
//
// It is deserialized from the .zenith/config file, or the config file passed
// to 'zenith init --config=<path>' option. See control_plane/simple.conf for
// an example.
//
#[derive(Serialize, Deserialize, Clone, Debug)]
pub struct LocalEnv {
// Pageserver connection strings
pub pageserver_connstring: String,
// Base directory for both pageserver and compute nodes
// Base directory for all the nodes (the pageserver, safekeepers and
// compute nodes).
//
// This is not stored in the config file. Rather, this is the path where the
// config file itself is. It is read from the ZENITH_REPO_DIR env variable or
// '.zenith' if not given.
#[serde(skip)]
pub base_data_dir: PathBuf,
// Path to postgres distribution. It's expected that "bin", "include",
// "lib", "share" from postgres distribution are there. If at some point
// in time we will be able to run against vanilla postgres we may split that
// to four separate paths and match OS-specific installation layout.
#[serde(default)]
pub pg_distrib_dir: PathBuf,
// Path to pageserver binary. Empty for remote pageserver.
pub zenith_distrib_dir: Option<PathBuf>,
// Path to pageserver binary.
#[serde(default)]
pub zenith_distrib_dir: PathBuf,
// keeping tenant id in config to reduce copy paste when running zenith locally with single tenant
#[serde(with = "hex")]
pub tenantid: ZTenantId,
// Default tenant ID to use with the 'zenith' command line utility, when
// --tenantid is not explicitly specified.
#[serde(with = "opt_tenantid_serde")]
#[serde(default)]
pub default_tenantid: Option<ZTenantId>,
// jwt auth token used for communication with pageserver
pub auth_token: String,
// used to issue tokens during e.g pg start
#[serde(default)]
pub private_key_path: PathBuf,
pub pageserver: PageServerConf,
#[serde(default)]
pub safekeepers: Vec<SafekeeperConf>,
}
#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(default)]
pub struct PageServerConf {
// Pageserver connection settings
pub pg_port: u16,
pub http_port: u16,
// used to determine which auth type is used
pub auth_type: AuthType,
// used to issue tokens during e.g pg start
pub private_key_path: PathBuf,
// jwt auth token used for communication with pageserver
pub auth_token: String,
}
pub remotes: Remotes,
impl Default for PageServerConf {
fn default() -> Self {
Self {
pg_port: 0,
http_port: 0,
auth_type: AuthType::Trust,
auth_token: "".to_string(),
}
}
}
#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(default)]
pub struct SafekeeperConf {
pub name: String,
pub pg_port: u16,
pub http_port: u16,
pub sync: bool,
}
impl Default for SafekeeperConf {
fn default() -> Self {
Self {
name: "".to_string(),
pg_port: 0,
http_port: 0,
sync: true,
}
}
}
impl LocalEnv {
@@ -64,11 +115,11 @@ impl LocalEnv {
}
pub fn pageserver_bin(&self) -> Result<PathBuf> {
Ok(self
.zenith_distrib_dir
.as_ref()
.ok_or_else(|| anyhow!("Can not manage remote pageserver"))?
.join("pageserver"))
Ok(self.zenith_distrib_dir.join("pageserver"))
}
pub fn safekeeper_bin(&self) -> Result<PathBuf> {
Ok(self.zenith_distrib_dir.join("safekeeper"))
}
pub fn pg_data_dirs_path(&self) -> PathBuf {
@@ -85,6 +136,187 @@ impl LocalEnv {
pub fn pageserver_data_dir(&self) -> PathBuf {
self.base_data_dir.clone()
}
pub fn safekeeper_data_dir(&self, node_name: &str) -> PathBuf {
self.base_data_dir.join("safekeepers").join(node_name)
}
/// Create a LocalEnv from a config file.
///
/// Unlike 'load_config', this function fills in any defaults that are missing
/// from the config file.
pub fn create_config(toml: &str) -> Result<LocalEnv> {
let mut env: LocalEnv = toml::from_str(toml)?;
// Find postgres binaries.
// Follow POSTGRES_DISTRIB_DIR if set, otherwise look in "tmp_install".
if env.pg_distrib_dir == Path::new("") {
if let Some(postgres_bin) = env::var_os("POSTGRES_DISTRIB_DIR") {
env.pg_distrib_dir = postgres_bin.into();
} else {
let cwd = env::current_dir()?;
env.pg_distrib_dir = cwd.join("tmp_install")
}
}
if !env.pg_distrib_dir.join("bin/postgres").exists() {
anyhow::bail!(
"Can't find postgres binary at {}",
env.pg_distrib_dir.display()
);
}
// Find zenith binaries.
if env.zenith_distrib_dir == Path::new("") {
env.zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
}
if !env.zenith_distrib_dir.join("pageserver").exists() {
anyhow::bail!("Can't find pageserver binary.");
}
if !env.zenith_distrib_dir.join("safekeeper").exists() {
anyhow::bail!("Can't find safekeeper binary.");
}
// If no initial tenant ID was given, generate it.
if env.default_tenantid.is_none() {
env.default_tenantid = Some(ZTenantId::generate());
}
env.base_data_dir = base_path();
Ok(env)
}
/// Locate and load config
pub fn load_config() -> Result<LocalEnv> {
let repopath = base_path();
if !repopath.exists() {
anyhow::bail!(
"Zenith config is not found in {}. You need to run 'zenith init' first",
repopath.to_str().unwrap()
);
}
// TODO: check that it looks like a zenith repository
// load and parse file
let config = fs::read_to_string(repopath.join("config"))?;
let mut env: LocalEnv = toml::from_str(config.as_str())?;
env.base_data_dir = repopath;
Ok(env)
}
// this function is used only for testing purposes in CLI e g generate tokens during init
pub fn generate_auth_token(&self, claims: &Claims) -> Result<String> {
let private_key_path = if self.private_key_path.is_absolute() {
self.private_key_path.to_path_buf()
} else {
self.base_data_dir.join(&self.private_key_path)
};
let key_data = fs::read(private_key_path)?;
encode_from_key_file(claims, &key_data)
}
//
// Initialize a new Zenith repository
//
pub fn init(&mut self) -> Result<()> {
// check if config already exists
let base_path = &self.base_data_dir;
if base_path == Path::new("") {
anyhow::bail!("repository base path is missing");
}
if base_path.exists() {
anyhow::bail!(
"directory '{}' already exists. Perhaps already initialized?",
base_path.to_str().unwrap()
);
}
fs::create_dir(&base_path)?;
// generate keys for jwt
// openssl genrsa -out private_key.pem 2048
let private_key_path;
if self.private_key_path == PathBuf::new() {
private_key_path = base_path.join("auth_private_key.pem");
let keygen_output = Command::new("openssl")
.arg("genrsa")
.args(&["-out", private_key_path.to_str().unwrap()])
.arg("2048")
.stdout(Stdio::null())
.output()
.with_context(|| "failed to generate auth private key")?;
if !keygen_output.status.success() {
anyhow::bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
self.private_key_path = Path::new("auth_private_key.pem").to_path_buf();
let public_key_path = base_path.join("auth_public_key.pem");
// openssl rsa -in private_key.pem -pubout -outform PEM -out public_key.pem
let keygen_output = Command::new("openssl")
.arg("rsa")
.args(&["-in", private_key_path.to_str().unwrap()])
.arg("-pubout")
.args(&["-outform", "PEM"])
.args(&["-out", public_key_path.to_str().unwrap()])
.stdout(Stdio::null())
.output()
.with_context(|| "failed to generate auth private key")?;
if !keygen_output.status.success() {
anyhow::bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
}
self.pageserver.auth_token =
self.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
fs::create_dir_all(self.pg_data_dirs_path())?;
for safekeeper in self.safekeepers.iter() {
fs::create_dir_all(self.safekeeper_data_dir(&safekeeper.name))?;
}
let mut conf_content = String::new();
// Currently, the user first passes a config file with 'zenith init --config=<path>'
// We read that in, in `create_config`, and fill any missing defaults. Then it's saved
// to .zenith/config. TODO: We lose any formatting and comments along the way, which is
// a bit sad.
write!(
&mut conf_content,
r#"# This file describes a locale deployment of the page server
# and safekeeeper node. It is read by the 'zenith' command-line
# utility.
"#
)?;
// Convert the LocalEnv to a toml file.
//
// This could be as simple as this:
//
// conf_content += &toml::to_string_pretty(env)?;
//
// But it results in a "values must be emitted before tables". I'm not sure
// why, AFAICS the table, i.e. 'safekeepers: Vec<SafekeeperConf>' is last.
// Maybe rust reorders the fields to squeeze avoid padding or something?
// In any case, converting to toml::Value first, and serializing that, works.
// See https://github.com/alexcrichton/toml-rs/issues/142
conf_content += &toml::to_string_pretty(&toml::Value::try_from(&self)?)?;
fs::write(base_path.join("config"), conf_content)?;
Ok(())
}
}
fn base_path() -> PathBuf {
@@ -94,143 +326,29 @@ fn base_path() -> PathBuf {
}
}
//
// Initialize a new Zenith repository
//
pub fn init(
remote_pageserver: Option<&str>,
tenantid: ZTenantId,
auth_type: AuthType,
) -> Result<()> {
// check if config already exists
let base_path = base_path();
if base_path.exists() {
anyhow::bail!(
"{} already exists. Perhaps already initialized?",
base_path.to_str().unwrap()
);
}
fs::create_dir(&base_path)?;
/// Serde routines for Option<ZTenantId>. The serialized form is a hex string.
mod opt_tenantid_serde {
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use std::str::FromStr;
use zenith_utils::zid::ZTenantId;
// ok, now check that expected binaries are present
// Find postgres binaries. Follow POSTGRES_DISTRIB_DIR if set, otherwise look in "tmp_install".
let pg_distrib_dir: PathBuf = {
if let Some(postgres_bin) = env::var_os("POSTGRES_DISTRIB_DIR") {
postgres_bin.into()
} else {
let cwd = env::current_dir()?;
cwd.join("tmp_install")
}
};
if !pg_distrib_dir.join("bin/postgres").exists() {
anyhow::bail!("Can't find postgres binary at {:?}", pg_distrib_dir);
pub fn serialize<S>(tenantid: &Option<ZTenantId>, ser: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
tenantid.map(|t| t.to_string()).serialize(ser)
}
// generate keys for jwt
// openssl genrsa -out private_key.pem 2048
let private_key_path = base_path.join("auth_private_key.pem");
let keygen_output = Command::new("openssl")
.arg("genrsa")
.args(&["-out", private_key_path.to_str().unwrap()])
.arg("2048")
.stdout(Stdio::null())
.output()
.with_context(|| "failed to generate auth private key")?;
if !keygen_output.status.success() {
anyhow::bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
pub fn deserialize<'de, D>(des: D) -> Result<Option<ZTenantId>, D::Error>
where
D: Deserializer<'de>,
{
let s: Option<String> = Option::deserialize(des)?;
if let Some(s) = s {
return Ok(Some(
ZTenantId::from_str(&s).map_err(serde::de::Error::custom)?,
));
}
Ok(None)
}
let public_key_path = base_path.join("auth_public_key.pem");
// openssl rsa -in private_key.pem -pubout -outform PEM -out public_key.pem
let keygen_output = Command::new("openssl")
.arg("rsa")
.args(&["-in", private_key_path.to_str().unwrap()])
.arg("-pubout")
.args(&["-outform", "PEM"])
.args(&["-out", public_key_path.to_str().unwrap()])
.stdout(Stdio::null())
.output()
.with_context(|| "failed to generate auth private key")?;
if !keygen_output.status.success() {
anyhow::bail!(
"openssl failed: '{}'",
String::from_utf8_lossy(&keygen_output.stderr)
);
}
let auth_token =
encode_from_key_path(&Claims::new(None, Scope::PageServerApi), &private_key_path)?;
let conf = if let Some(addr) = remote_pageserver {
// check that addr is parsable
let _uri = Url::parse(addr).map_err(|e| anyhow!("{}: {}", addr, e))?;
LocalEnv {
pageserver_connstring: format!("postgresql://{}/", addr),
pg_distrib_dir,
zenith_distrib_dir: None,
base_data_dir: base_path,
remotes: BTreeMap::default(),
tenantid,
auth_token,
auth_type,
private_key_path,
}
} else {
// Find zenith binaries.
let zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
if !zenith_distrib_dir.join("pageserver").exists() {
anyhow::bail!("Can't find pageserver binary.",);
}
LocalEnv {
pageserver_connstring: "postgresql://127.0.0.1:6400".to_string(),
pg_distrib_dir,
zenith_distrib_dir: Some(zenith_distrib_dir),
base_data_dir: base_path,
remotes: BTreeMap::default(),
tenantid,
auth_token,
auth_type,
private_key_path,
}
};
fs::create_dir_all(conf.pg_data_dirs_path())?;
let toml = toml::to_string_pretty(&conf)?;
fs::write(conf.base_data_dir.join("config"), toml)?;
Ok(())
}
// Locate and load config
pub fn load_config() -> Result<LocalEnv> {
let repopath = base_path();
if !repopath.exists() {
anyhow::bail!(
"Zenith config is not found in {}. You need to run 'zenith init' first",
repopath.to_str().unwrap()
);
}
// TODO: check that it looks like a zenith repository
// load and parse file
let config = fs::read_to_string(repopath.join("config"))?;
toml::from_str(config.as_str()).map_err(|e| e.into())
}
// Save config. We use that to change set of remotes from CLI itself.
pub fn save_config(conf: &LocalEnv) -> Result<()> {
let config_path = base_path().join("config");
let conf_str = toml::to_string_pretty(conf)?;
fs::write(config_path, conf_str)?;
Ok(())
}

View File

@@ -0,0 +1,228 @@
///
/// Module for parsing postgresql.conf file.
///
/// NOTE: This doesn't implement the full, correct postgresql.conf syntax. Just
/// enough to extract a few settings we need in Zenith, assuming you don't do
/// funny stuff like include-directives or funny escaping.
use anyhow::{anyhow, bail, Context, Result};
use lazy_static::lazy_static;
use regex::Regex;
use std::collections::HashMap;
use std::fmt;
use std::io::BufRead;
use std::str::FromStr;
/// In-memory representation of a postgresql.conf file
#[derive(Default)]
pub struct PostgresConf {
lines: Vec<String>,
hash: HashMap<String, String>,
}
lazy_static! {
static ref CONF_LINE_RE: Regex = Regex::new(r"^((?:\w|\.)+)\s*=\s*(\S+)$").unwrap();
}
impl PostgresConf {
pub fn new() -> PostgresConf {
PostgresConf::default()
}
/// Read file into memory
pub fn read(read: impl std::io::Read) -> Result<PostgresConf> {
let mut result = Self::new();
for line in std::io::BufReader::new(read).lines() {
let line = line?;
// Store each line in a vector, in original format
result.lines.push(line.clone());
// Also parse each line and insert key=value lines into a hash map.
//
// FIXME: This doesn't match exactly the flex/bison grammar in PostgreSQL.
// But it's close enough for our usage.
let line = line.trim();
if line.starts_with('#') {
// comment, ignore
continue;
} else if let Some(caps) = CONF_LINE_RE.captures(line) {
let name = caps.get(1).unwrap().as_str();
let raw_val = caps.get(2).unwrap().as_str();
if let Ok(val) = deescape_str(raw_val) {
// Note: if there's already an entry in the hash map for
// this key, this will replace it. That's the behavior what
// we want; when PostgreSQL reads the file, each line
// overrides any previous value for the same setting.
result.hash.insert(name.to_string(), val.to_string());
}
}
}
Ok(result)
}
/// Return the current value of 'option'
pub fn get(&self, option: &str) -> Option<&str> {
self.hash.get(option).map(|x| x.as_ref())
}
/// Return the current value of a field, parsed to the right datatype.
///
/// This calls the FromStr::parse() function on the value of the field. If
/// the field does not exist, or parsing fails, returns an error.
///
pub fn parse_field<T>(&self, field_name: &str, context: &str) -> Result<T>
where
T: FromStr,
<T as FromStr>::Err: std::error::Error + Send + Sync + 'static,
{
self.get(field_name)
.ok_or_else(|| anyhow!("could not find '{}' option {}", field_name, context))?
.parse::<T>()
.with_context(|| format!("could not parse '{}' option {}", field_name, context))
}
pub fn parse_field_optional<T>(&self, field_name: &str, context: &str) -> Result<Option<T>>
where
T: FromStr,
<T as FromStr>::Err: std::error::Error + Send + Sync + 'static,
{
if let Some(val) = self.get(field_name) {
let result = val
.parse::<T>()
.with_context(|| format!("could not parse '{}' option {}", field_name, context))?;
Ok(Some(result))
} else {
Ok(None)
}
}
///
/// Note: if you call this multiple times for the same option, the config
/// file will a line for each call. It would be nice to have a function
/// to change an existing line, but that's a TODO.
///
pub fn append(&mut self, option: &str, value: &str) {
self.lines
.push(format!("{}={}\n", option, escape_str(value)));
self.hash.insert(option.to_string(), value.to_string());
}
/// Append an arbitrary non-setting line to the config file
pub fn append_line(&mut self, line: &str) {
self.lines.push(line.to_string());
}
}
impl fmt::Display for PostgresConf {
/// Return the whole configuration file as a string
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
for line in self.lines.iter() {
f.write_str(line)?;
}
Ok(())
}
}
/// Escape a value for putting in postgresql.conf.
fn escape_str(s: &str) -> String {
// If the string doesn't contain anything that needs quoting or escaping, return it
// as it is.
//
// The first part of the regex, before the '|', matches the INTEGER rule in the
// PostgreSQL flex grammar (guc-file.l). It matches plain integers like "123" and
// "-123", and also accepts units like "10MB". The second part of the regex matches
// the UNQUOTED_STRING rule, and accepts strings that contain a single word, beginning
// with a letter. That covers words like "off" or "posix". Everything else is quoted.
//
// This regex is a bit more conservative than the rules in guc-file.l, so we quote some
// strings that PostgreSQL would accept without quoting, but that's OK.
lazy_static! {
static ref UNQUOTED_RE: Regex =
Regex::new(r"(^[-+]?[0-9]+[a-zA-Z]*$)|(^[a-zA-Z][a-zA-Z0-9]*$)").unwrap();
}
if UNQUOTED_RE.is_match(s) {
s.to_string()
} else {
// Otherwise escape and quote it
let s = s
.replace('\\', "\\\\")
.replace('\n', "\\n")
.replace('\'', "''");
"\'".to_owned() + &s + "\'"
}
}
/// De-escape a possibly-quoted value.
///
/// See `DeescapeQuotedString` function in PostgreSQL sources for how PostgreSQL
/// does this.
fn deescape_str(s: &str) -> Result<String> {
// If the string has a quote at the beginning and end, strip them out.
if s.len() >= 2 && s.starts_with('\'') && s.ends_with('\'') {
let mut result = String::new();
let mut iter = s[1..(s.len() - 1)].chars().peekable();
while let Some(c) = iter.next() {
let newc = if c == '\\' {
match iter.next() {
Some('b') => '\x08',
Some('f') => '\x0c',
Some('n') => '\n',
Some('r') => '\r',
Some('t') => '\t',
Some('0'..='7') => {
// TODO
bail!("octal escapes not supported");
}
Some(n) => n,
None => break,
}
} else if c == '\'' && iter.peek() == Some(&'\'') {
// doubled quote becomes just one quote
iter.next().unwrap()
} else {
c
};
result.push(newc);
}
Ok(result)
} else {
Ok(s.to_string())
}
}
#[test]
fn test_postgresql_conf_escapes() -> Result<()> {
assert_eq!(escape_str("foo bar"), "'foo bar'");
// these don't need to be quoted
assert_eq!(escape_str("foo"), "foo");
assert_eq!(escape_str("123"), "123");
assert_eq!(escape_str("+123"), "+123");
assert_eq!(escape_str("-10"), "-10");
assert_eq!(escape_str("1foo"), "1foo");
assert_eq!(escape_str("foo1"), "foo1");
assert_eq!(escape_str("10MB"), "10MB");
assert_eq!(escape_str("-10kB"), "-10kB");
// these need quoting and/or escaping
assert_eq!(escape_str("foo bar"), "'foo bar'");
assert_eq!(escape_str("fo'o"), "'fo''o'");
assert_eq!(escape_str("fo\no"), "'fo\\no'");
assert_eq!(escape_str("fo\\o"), "'fo\\\\o'");
assert_eq!(escape_str("10 cats"), "'10 cats'");
// Test de-escaping
assert_eq!(deescape_str(&escape_str("foo"))?, "foo");
assert_eq!(deescape_str(&escape_str("fo'o\nba\\r"))?, "fo'o\nba\\r");
assert_eq!(deescape_str("'\\b\\f\\n\\r\\t'")?, "\x08\x0c\n\r\t");
// octal-escapes are currently not supported
assert!(deescape_str("'foo\\7\\07\\007'").is_err());
Ok(())
}

View File

@@ -0,0 +1,281 @@
use std::io::Write;
use std::net::TcpStream;
use std::path::PathBuf;
use std::process::Command;
use std::sync::Arc;
use std::time::Duration;
use std::{io, result, thread};
use anyhow::bail;
use nix::errno::Errno;
use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid;
use postgres::Config;
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::postgres_backend::AuthType;
use crate::local_env::{LocalEnv, SafekeeperConf};
use crate::read_pidfile;
use crate::storage::PageServerNode;
use zenith_utils::connstring::connection_address;
use zenith_utils::connstring::connection_host_port;
#[derive(Error, Debug)]
pub enum SafekeeperHttpError {
#[error("Reqwest error: {0}")]
Transport(#[from] reqwest::Error),
#[error("Error: {0}")]
Response(String),
}
type Result<T> = result::Result<T, SafekeeperHttpError>;
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
}
// reqwest do not export it's error construction utility functions, so lets craft the message ourselves
let url = self.url().to_owned();
Err(SafekeeperHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
))
}
}
//
// Control routines for safekeeper.
//
// Used in CLI and tests.
//
#[derive(Debug)]
pub struct SafekeeperNode {
pub name: String,
pub conf: SafekeeperConf,
pub pg_connection_config: Config,
pub env: LocalEnv,
pub http_client: Client,
pub http_base_url: String,
pub pageserver: Arc<PageServerNode>,
}
impl SafekeeperNode {
pub fn from_env(env: &LocalEnv, conf: &SafekeeperConf) -> SafekeeperNode {
let pageserver = Arc::new(PageServerNode::from_env(env));
println!("initializing for {} for {}", conf.name, conf.http_port);
SafekeeperNode {
name: conf.name.clone(),
conf: conf.clone(),
pg_connection_config: Self::safekeeper_connection_config(conf.pg_port),
env: env.clone(),
http_client: Client::new(),
http_base_url: format!("http://localhost:{}/v1", conf.http_port),
pageserver,
}
}
/// Construct libpq connection string for connecting to this safekeeper.
fn safekeeper_connection_config(port: u16) -> Config {
// TODO safekeeper authentication not implemented yet
format!("postgresql://no_user@localhost:{}/no_db", port)
.parse()
.unwrap()
}
pub fn datadir_path(&self) -> PathBuf {
self.env.safekeeper_data_dir(&self.name)
}
pub fn pid_file(&self) -> PathBuf {
self.datadir_path().join("safekeeper.pid")
}
pub fn start(&self) -> anyhow::Result<()> {
print!(
"Starting safekeeper at '{}' in '{}'",
connection_address(&self.pg_connection_config),
self.datadir_path().display()
);
io::stdout().flush().unwrap();
// Configure connection to page server
//
// FIXME: We extract the host and port from the connection string instead of using
// the connection string directly, because the 'safekeeper' binary expects
// host:port format. That's a bit silly when we already have a full libpq connection
// string at hand.
let pageserver_conn = {
let (host, port) = connection_host_port(&self.pageserver.pg_connection_config);
format!("{}:{}", host, port)
};
let listen_pg = format!("localhost:{}", self.conf.pg_port);
let listen_http = format!("localhost:{}", self.conf.http_port);
let mut cmd = Command::new(self.env.safekeeper_bin()?);
cmd.args(&["-D", self.datadir_path().to_str().unwrap()])
.args(&["--listen-pg", &listen_pg])
.args(&["--listen-http", &listen_http])
.args(&["--pageserver", &pageserver_conn])
.args(&["--recall", "1 second"])
.arg("--daemonize")
.env_clear()
.env("RUST_BACKTRACE", "1");
if !self.conf.sync {
cmd.arg("--no-sync");
}
if self.env.pageserver.auth_type == AuthType::ZenithJWT {
cmd.env("PAGESERVER_AUTH_TOKEN", &self.env.pageserver.auth_token);
}
let var = "LLVM_PROFILE_FILE";
if let Some(val) = std::env::var_os(var) {
cmd.env(var, val);
}
if !cmd.status()?.success() {
bail!(
"Safekeeper failed to start. See '{}' for details.",
self.datadir_path().join("safekeeper.log").display()
);
}
// It takes a while for the safekeeper to start up. Wait until it is
// open for business.
const RETRIES: i8 = 15;
for retries in 1..RETRIES {
match self.check_status() {
Ok(_) => {
println!("\nSafekeeper started");
return Ok(());
}
Err(err) => {
match err {
SafekeeperHttpError::Transport(err) => {
if err.is_connect() && retries < 5 {
print!(".");
io::stdout().flush().unwrap();
} else {
if retries == 5 {
println!() // put a line break after dots for second message
}
println!(
"Safekeeper not responding yet, err {} retrying ({})...",
err, retries
);
}
}
SafekeeperHttpError::Response(msg) => {
bail!("safekeeper failed to start: {} ", msg)
}
}
thread::sleep(Duration::from_secs(1));
}
}
}
bail!("safekeeper failed to start in {} seconds", RETRIES);
}
///
/// Stop the server.
///
/// If 'immediate' is true, we use SIGQUIT, killing the process immediately.
/// Otherwise we use SIGTERM, triggering a clean shutdown
///
/// If the server is not running, returns success
///
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
let pid_file = self.pid_file();
if !pid_file.exists() {
println!("Safekeeper {} is already stopped", self.name);
return Ok(());
}
let pid = read_pidfile(&pid_file)?;
let pid = Pid::from_raw(pid);
let sig = if immediate {
println!("Stop safekeeper immediately");
Signal::SIGQUIT
} else {
println!("Stop safekeeper gracefully");
Signal::SIGTERM
};
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
println!(
"Safekeeper with pid {} does not exist, but a PID file was found",
pid
);
return Ok(());
}
Err(err) => bail!(
"Failed to send signal to safekeeper with pid {}: {}",
pid,
err.desc()
),
}
let address = connection_address(&self.pg_connection_config);
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if safekeeper flushes a lot of data
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Safekeeper stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Safekeeper status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Safekeeper status is: {}", err);
return Ok(());
}
}
} else {
println!("Safekeeper still receives connections");
thread::sleep(Duration::from_secs(1));
}
}
bail!("Failed to stop safekeeper with pid {}", pid);
}
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
// TODO: authentication
//if self.env.auth_type == AuthType::ZenithJWT {
// builder = builder.bearer_auth(&self.env.safekeeper_auth_token)
//}
self.http_client.request(method, url)
}
pub fn check_status(&self) -> Result<()> {
self.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status"))
.send()?
.error_from_body()?;
Ok(())
}
}

View File

@@ -1,26 +1,61 @@
use std::collections::HashMap;
use std::io::Write;
use std::net::TcpStream;
use std::path::PathBuf;
use std::process::Command;
use std::thread;
use std::time::Duration;
use std::{io, result, thread};
use anyhow::{anyhow, bail, ensure, Result};
use anyhow::bail;
use nix::errno::Errno;
use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid;
use pageserver::http::models::{BranchCreateRequest, TenantCreateRequest};
use postgres::{Config, NoTls};
use reqwest::blocking::{Client, RequestBuilder};
use reqwest::{IntoUrl, Method, StatusCode};
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::ZTenantId;
use crate::local_env::LocalEnv;
use crate::read_pidfile;
use pageserver::branches::BranchInfo;
use pageserver::tenant_mgr::TenantInfo;
use zenith_utils::connstring::connection_address;
const HTTP_BASE_URL: &str = "http://127.0.0.1:9898/v1";
#[derive(Error, Debug)]
pub enum PageserverHttpError {
#[error("Reqwest error: {0}")]
Transport(#[from] reqwest::Error),
#[error("Error: {0}")]
Response(String),
}
type Result<T> = result::Result<T, PageserverHttpError>;
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
}
// reqwest do not export it's error construction utility functions, so lets craft the message ourselves
let url = self.url().to_owned();
Err(PageserverHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
))
}
}
//
// Control routines for pageserver.
@@ -29,7 +64,6 @@ const HTTP_BASE_URL: &str = "http://127.0.0.1:9898/v1";
//
#[derive(Debug)]
pub struct PageServerNode {
pub kill_on_exit: bool,
pub pg_connection_config: Config,
pub env: LocalEnv,
pub http_client: Client,
@@ -38,58 +72,68 @@ pub struct PageServerNode {
impl PageServerNode {
pub fn from_env(env: &LocalEnv) -> PageServerNode {
let password = if env.auth_type == AuthType::ZenithJWT {
&env.auth_token
let password = if env.pageserver.auth_type == AuthType::ZenithJWT {
&env.pageserver.auth_token
} else {
""
};
PageServerNode {
kill_on_exit: false,
pg_connection_config: Self::default_config(password), // default
pg_connection_config: Self::pageserver_connection_config(
password,
env.pageserver.pg_port,
),
env: env.clone(),
http_client: Client::new(),
http_base_url: HTTP_BASE_URL.to_owned(),
http_base_url: format!("http://localhost:{}/v1", env.pageserver.http_port),
}
}
fn default_config(password: &str) -> Config {
format!("postgresql://no_user:{}@localhost:64000/no_db", password)
/// Construct libpq connection string for connecting to the pageserver.
fn pageserver_connection_config(password: &str, port: u16) -> Config {
format!("postgresql://no_user:{}@localhost:{}/no_db", password, port)
.parse()
.unwrap()
}
pub fn init(&self, create_tenant: Option<&str>, enable_auth: bool) -> Result<()> {
let mut cmd = Command::new(self.env.pageserver_bin()?);
pub fn init(&self, create_tenant: Option<&str>) -> anyhow::Result<()> {
let listen_pg = format!("localhost:{}", self.env.pageserver.pg_port);
let listen_http = format!("localhost:{}", self.env.pageserver.http_port);
let mut args = vec![
"--init",
"-D",
self.env.base_data_dir.to_str().unwrap(),
"--postgres-distrib",
self.env.pg_distrib_dir.to_str().unwrap(),
"--listen-pg",
&listen_pg,
"--listen-http",
&listen_http,
];
if enable_auth {
let auth_type_str = &self.env.pageserver.auth_type.to_string();
if self.env.pageserver.auth_type != AuthType::Trust {
args.extend(&["--auth-validation-public-key-path", "auth_public_key.pem"]);
args.extend(&["--auth-type", "ZenithJWT"]);
}
args.extend(&["--auth-type", auth_type_str]);
if let Some(tenantid) = create_tenant {
args.extend(&["--create-tenant", tenantid])
}
let status = cmd
.args(args)
.env_clear()
.env("RUST_BACKTRACE", "1")
.status()
.expect("pageserver init failed");
let mut cmd = Command::new(self.env.pageserver_bin()?);
cmd.args(args).env_clear().env("RUST_BACKTRACE", "1");
if status.success() {
Ok(())
} else {
Err(anyhow!("pageserver init failed"))
let var = "LLVM_PROFILE_FILE";
if let Some(val) = std::env::var_os(var) {
cmd.env(var, val);
}
if !cmd.status()?.success() {
bail!("pageserver init failed");
}
Ok(())
}
pub fn repo_path(&self) -> PathBuf {
@@ -100,19 +144,25 @@ impl PageServerNode {
self.repo_path().join("pageserver.pid")
}
pub fn start(&self) -> Result<()> {
println!(
"Starting pageserver at '{}' in {}",
pub fn start(&self) -> anyhow::Result<()> {
print!(
"Starting pageserver at '{}' in '{}'",
connection_address(&self.pg_connection_config),
self.repo_path().display()
);
io::stdout().flush().unwrap();
let mut cmd = Command::new(self.env.pageserver_bin()?);
cmd.args(&["-D", self.repo_path().to_str().unwrap()])
.arg("-d")
.arg("--daemonize")
.env_clear()
.env("RUST_BACKTRACE", "1");
let var = "LLVM_PROFILE_FILE";
if let Some(val) = std::env::var_os(var) {
cmd.env(var, val);
}
if !cmd.status()?.success() {
bail!(
"Pageserver failed to start. See '{}' for details.",
@@ -122,41 +172,103 @@ impl PageServerNode {
// It takes a while for the page server to start up. Wait until it is
// open for business.
for retries in 1..15 {
const RETRIES: i8 = 15;
for retries in 1..RETRIES {
match self.check_status() {
Ok(_) => {
println!("Pageserver started");
println!("\nPageserver started");
return Ok(());
}
Err(err) => {
println!(
"Pageserver not responding yet, err {} retrying ({})...",
err, retries
);
match err {
PageserverHttpError::Transport(err) => {
if err.is_connect() && retries < 5 {
print!(".");
io::stdout().flush().unwrap();
} else {
if retries == 5 {
println!() // put a line break after dots for second message
}
println!(
"Pageserver not responding yet, err {} retrying ({})...",
err, retries
);
}
}
PageserverHttpError::Response(msg) => {
bail!("pageserver failed to start: {} ", msg)
}
}
thread::sleep(Duration::from_secs(1));
}
}
}
bail!("pageserver failed to start");
bail!("pageserver failed to start in {} seconds", RETRIES);
}
pub fn stop(&self) -> Result<()> {
let pid = read_pidfile(&self.pid_file())?;
let pid = Pid::from_raw(pid);
if kill(pid, Signal::SIGTERM).is_err() {
bail!("Failed to kill pageserver with pid {}", pid);
///
/// Stop the server.
///
/// If 'immediate' is true, we use SIGQUIT, killing the process immediately.
/// Otherwise we use SIGTERM, triggering a clean shutdown
///
/// If the server is not running, returns success
///
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
let pid_file = self.pid_file();
if !pid_file.exists() {
println!("Pageserver is already stopped");
return Ok(());
}
let pid = Pid::from_raw(read_pidfile(&pid_file)?);
// wait for pageserver stop
let address = connection_address(&self.pg_connection_config);
for _ in 0..5 {
let stream = TcpStream::connect(&address);
thread::sleep(Duration::from_secs(1));
if let Err(_e) = stream {
println!("Pageserver stopped");
let sig = if immediate {
println!("Stop pageserver immediately");
Signal::SIGQUIT
} else {
println!("Stop pageserver gracefully");
Signal::SIGTERM
};
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
println!(
"Pageserver with pid {} does not exist, but a PID file was found",
pid
);
return Ok(());
}
println!("Stopping pageserver on {}", address);
Err(err) => bail!(
"Failed to send signal to pageserver with pid {}: {}",
pid,
err.desc()
),
}
let address = connection_address(&self.pg_connection_config);
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if pageserver checkpoints a lot of data
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Pageserver stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Pageserver status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Pageserver status is: {}", err);
return Ok(());
}
}
} else {
println!("Pageserver still receives connections");
thread::sleep(Duration::from_secs(1));
}
}
bail!("Failed to stop pageserver with pid {}", pid);
@@ -169,35 +281,30 @@ impl PageServerNode {
client.simple_query(sql).unwrap()
}
pub fn page_server_psql_client(&self) -> Result<postgres::Client, postgres::Error> {
pub fn page_server_psql_client(&self) -> result::Result<postgres::Client, postgres::Error> {
self.pg_connection_config.connect(NoTls)
}
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
let mut builder = self.http_client.request(method, url);
if self.env.auth_type == AuthType::ZenithJWT {
builder = builder.bearer_auth(&self.env.auth_token)
if self.env.pageserver.auth_type == AuthType::ZenithJWT {
builder = builder.bearer_auth(&self.env.pageserver.auth_token)
}
builder
}
pub fn check_status(&self) -> Result<()> {
let status = self
.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status"))
self.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status"))
.send()?
.status();
ensure!(
status == StatusCode::OK,
format!("got unexpected response status {}", status)
);
.error_from_body()?;
Ok(())
}
pub fn tenant_list(&self) -> Result<Vec<String>> {
pub fn tenant_list(&self) -> Result<Vec<TenantInfo>> {
Ok(self
.http_request(Method::GET, format!("{}/{}", self.http_base_url, "tenant"))
.send()?
.error_for_status()?
.error_from_body()?
.json()?)
}
@@ -208,7 +315,7 @@ impl PageServerNode {
tenant_id: tenantid,
})
.send()?
.error_for_status()?
.error_from_body()?
.json()?)
}
@@ -219,7 +326,7 @@ impl PageServerNode {
format!("{}/branch/{}", self.http_base_url, tenantid),
)
.send()?
.error_for_status()?
.error_from_body()?
.json()?)
}
@@ -230,42 +337,29 @@ impl PageServerNode {
tenantid: &ZTenantId,
) -> Result<BranchInfo> {
Ok(self
.http_request(Method::POST, format!("{}/{}", self.http_base_url, "branch"))
.http_request(Method::POST, format!("{}/branch", self.http_base_url))
.json(&BranchCreateRequest {
tenant_id: tenantid.to_owned(),
name: branch_name.to_owned(),
start_point: startpoint.to_owned(),
})
.send()?
.error_for_status()?
.error_from_body()?
.json()?)
}
// TODO: make this a separate request type and avoid loading all the branches
pub fn branch_get_by_name(
&self,
tenantid: &ZTenantId,
branch_name: &str,
) -> Result<BranchInfo> {
let branch_infos = self.branch_list(tenantid)?;
let branch_by_name: Result<HashMap<String, BranchInfo>> = branch_infos
.into_iter()
.map(|branch_info| Ok((branch_info.name.clone(), branch_info)))
.collect();
let branch_by_name = branch_by_name?;
let branch = branch_by_name
.get(branch_name)
.ok_or_else(|| anyhow!("Branch {} not found", branch_name))?;
Ok(branch.clone())
}
}
impl Drop for PageServerNode {
fn drop(&mut self) {
if self.kill_on_exit {
let _ = self.stop();
}
Ok(self
.http_request(
Method::GET,
format!("{}/branch/{}/{}", self.http_base_url, tenantid, branch_name),
)
.send()?
.error_for_status()?
.json()?)
}
}

View File

@@ -7,7 +7,7 @@ if [ "$1" = 'pageserver' ]; then
pageserver --init -D /data --postgres-distrib /usr/local
fi
echo "Staring pageserver at 0.0.0.0:6400"
pageserver -l 0.0.0.0:6400 -D /data
pageserver -l 0.0.0.0:6400 --listen-http 0.0.0.0:9898 -D /data
else
"$@"
fi

View File

@@ -10,5 +10,5 @@
- [pageserver/README](/pageserver/README) — pageserver overview.
- [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
- [walkeeper/README](/walkeeper/README.md) — WAL service overview.
- [walkeeper/README](/walkeeper/README) — WAL service overview.
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core

View File

@@ -4,7 +4,7 @@
Currently we build two main images:
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `wal_acceptor` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres).
And two intermediate images used either to reduce build time or to deliver some additional binary tools from other repos:

View File

@@ -30,12 +30,20 @@ writes out the changes from in-memory layers into new layer files[]. This proces
is called "checkpointing". The page server only creates layer files for
relations that have been modified since the last checkpoint.
Configuration parameter `checkpoint_distance` defines the distance
from current LSN to perform checkpoint of in-memory layers.
Default is `DEFAULT_CHECKPOINT_DISTANCE`.
Set this parameter to `0` to force checkpoint of every layer.
Configuration parameter `checkpoint_period` defines the interval between checkpoint iterations.
Default is `DEFAULT_CHECKPOINT_PERIOD`.
### Compute node
Stateless Postgres node that stores data in pageserver.
### Garbage collection
The process of removing old on-disk layers that are not needed by any timeline anymore.
### Fork
Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map.
@@ -43,11 +51,14 @@ Each PostgreSQL fork is considered a separate relish.
### Layer
Each layer corresponds to one RELISH_SEG_SIZE slice of a relish in a range of LSNs.
A layer contains data needed to reconstruct any page versions within the
layer's Segment and range of LSNs.
There are two kinds of layers, in-memory and on-disk layers. In-memory
layers are used to ingest incoming WAL, and provide fast access
to the recent page versions. On-disk layers are stored as files on disk, and
are immutable.
are immutable. See pageserver/src/layered_repository/README.md for more.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
@@ -122,6 +133,20 @@ One repository corresponds to one Tenant.
How much history do we need to keep around for PITR and read-only nodes?
### Segment (PostgreSQL)
NOTE: This is an overloaded term.
A physical file that stores data for a given relation. File segments are
limited in size by a compile-time setting (1 gigabyte by default), so if a
relation exceeds that size, it is split into multiple segments.
### Segment (Layered Repository)
NOTE: This is an overloaded term.
Segment is a RELISH_SEG_SIZE slice of relish (identified by a SegmentTag).
### SLRU
SLRUs include pg_clog, pg_multixact/members, and

View File

@@ -56,4 +56,4 @@ Tenant id is passed to postgres via GUC the same way as the timeline. Tenant id
### Safety
For now particular tenant can only appear on a particular pageserver. Set of WAL acceptors are also pinned to particular (tenantid, timeline) pair so there can only be one writer for particular (tenantid, timeline).
For now particular tenant can only appear on a particular pageserver. Set of safekeepers are also pinned to particular (tenantid, timeline) pair so there can only be one writer for particular (tenantid, timeline).

128
docs/settings.md Normal file
View File

@@ -0,0 +1,128 @@
## Pageserver
### listen_pg_addr
Network interface and port number to listen at for connections from
the compute nodes and safekeepers. The default is `127.0.0.1:64000`.
### listen_http_addr
Network interface and port number to listen at for admin connections.
The default is `127.0.0.1:9898`.
### checkpoint_distance
`checkpoint_distance` is the amount of incoming WAL that is held in
the open layer, before it's flushed to local disk. It puts an upper
bound on how much WAL needs to be re-processed after a pageserver
crash. It is a soft limit, the pageserver can momentarily go above it,
but it will trigger a checkpoint operation to get it back below the
limit.
`checkpoint_distance` also determines how much WAL needs to be kept
durable in the safekeeper. The safekeeper must have capacity to hold
this much WAL, with some headroom, otherwise you can get stuck in a
situation where the safekeeper is full and stops accepting new WAL,
but the pageserver is not flushing out and releasing the space in the
safekeeper because it hasn't reached checkpoint_distance yet.
`checkpoint_distance` also controls how often the WAL is uploaded to
S3.
The unit is # of bytes.
### checkpoint_period
The pageserver checks whether `checkpoint_distance` has been reached
every `checkpoint_period` seconds. Default is 1 s, which should be
fine.
### gc_horizon
`gz_horizon` determines how much history is retained, to allow
branching and read replicas at an older point in time. The unit is #
of bytes of WAL. Page versions older than this are garbage collected
away.
### gc_period
Interval at which garbage collection is triggered. Default is 100 s.
### superuser
Name of the initial superuser role, passed to initdb when a new tenant
is initialized. It doesn't affect anything after initialization. The
default is Note: The default is 'zenith_admin', and the console
depends on that, so if you change it, bad things will happen.
### page_cache_size
Size of the page cache, to hold materialized page versions. Unit is
number of 8 kB blocks. The default is 8192, which means 64 MB.
### max_file_descriptors
Max number of file descriptors to hold open concurrently for accessing
layer files. This should be kept well below the process/container/OS
limit (see `ulimit -n`), as the pageserver also needs file descriptors
for other files and for sockets for incoming connections.
### postgres-distrib
A directory with Postgres installation to use during pageserver activities.
Inside that dir, a `bin/postgres` binary should be present.
The default distrib dir is `./tmp_install/`.
### workdir (-D)
A directory in the file system, where pageserver will store its files.
The default is `./.zenith/`.
### Remote storage
There's a way to automatically backup and restore some of the pageserver's data from working dir to the remote storage.
The backup system is disabled by default and can be enabled for either of the currently available storages:
#### Local FS storage
##### remote-storage-local-path
Pageserver can back up and restore some of its workdir contents to another directory.
For that, only a path to that directory needs to be specified as a parameter.
#### S3 storage
Pageserver can back up and restore some of its workdir contents to S3.
Full set of S3 credentials is needed for that as parameters:
##### remote-storage-s3-bucket
Name of the bucket to connect to, example: "some-sample-bucket".
##### remote-storage-region
Name of the region where the bucket is located at, example: "eu-north-1"
##### remote-storage-access-key
Access key to connect to the bucket ("login" part of the credentials), example: "AKIAIOSFODNN7EXAMPLE"
##### remote-storage-secret-access-key
Secret access key to connect to the bucket ("password" part of the credentials), example: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
#### General remote storage configuration
Pagesever allows only one remote storage configured concurrently and errors if parameters from multiple different remote configurations are used.
No default values are used for the remote storage configuration parameters.
##### remote-storage-max-concurrent-sync
Max number of concurrent connections to open for uploading to or
downloading from S3.
The default value is 100.
## safekeeper
TODO

View File

@@ -79,3 +79,61 @@ Helpers for exposing Prometheus metrics from the server.
`/zenith_utils`:
Helpers that are shared between other crates in this repository.
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.
A single virtual environment with all dependencies is described in the single `Pipfile`.
### Prerequisites
- Install Python 3.7 (the minimal supported version)
- Later version (e.g. 3.8) is ok if you don't write Python code
- You can install Python 3.7 separately, e.g.:
```bash
# In Ubuntu
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
```
- Install `pipenv`
- Exact version of `pipenv` is not important, you can use Debian/Ubuntu package `pipenv`.
- Install dependencies via either
* `pipenv --python 3.7 install --dev` if you will write Python code, or
* `pipenv install` if you only want to run Python scripts and don't have Python 3.7.
Run `pipenv shell` to activate the virtual environment.
Alternatively, use `pipenv run` to run a single command in the venv, e.g. `pipenv run pytest`.
### Obligatory checks
We force code formatting via `yapf` and type hints via `mypy`.
Run the following commands in the repository's root (next to `setup.cfg`):
```bash
pipenv run yapf -ri . # All code is reformatted
pipenv run mypy . # Ensure there are no typing errors
```
**WARNING**: do not run `mypy` from a directory other than the root of the repository.
Otherwise it will not find its configuration.
Also consider:
* Running `flake8` (or a linter of your choice, e.g. `pycodestyle`) and fixing possible defects, if any.
* Adding more type hints to your code to avoid `Any`.
### Changing dependencies
You have to update `Pipfile.lock` if you have changed `Pipfile`:
```bash
pipenv --python 3.7 install --dev # Re-create venv for Python 3.7 and install recent pipenv inside
pipenv run pipenv --version # Should be at least 2021.5.29
pipenv run pipenv lock # Regenerate Pipfile.lock
```
As the minimal supported version is Python 3.7 and we use it in CI,
you have to use a Python 3.7 environment when updating `Pipfile.lock`.
Otherwise some back-compatibility packages will be missing.
It is also important to run recent `pipenv`.
Older versions remove markers from `Pipfile.lock`.

View File

@@ -4,10 +4,8 @@ version = "0.1.0"
authors = ["Stas Kelvich <stas@zenith.tech>"]
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
bookfile = "^0.3"
bookfile = { git = "https://github.com/zenithdb/bookfile.git", branch="generic-readext" }
chrono = "0.4.19"
rand = "0.8.3"
regex = "1.4.5"
@@ -16,14 +14,10 @@ byteorder = "1.4.3"
futures = "0.3.13"
hyper = "0.14"
lazy_static = "1.4.0"
slog-stdlog = "4.1.0"
slog-scope = "4.4.0"
slog-term = "2.8.0"
slog = "2.7.0"
log = "0.4.14"
clap = "2.33.0"
daemonize = "0.4.1"
tokio = { version = "1.5.0", features = ["full"] }
tokio = { version = "1.11", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
@@ -38,8 +32,21 @@ serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
toml = "0.5"
scopeguard = "1.1.0"
async-trait = "0.1"
const_format = "0.2.21"
tracing = "0.1.27"
signal-hook = "0.3.10"
url = "2"
nix = "0.23"
once_cell = "1.8.0"
rust-s3 = { version = "0.27.0-rc4", features = ["no-verify-ssl"] }
postgres_ffi = { path = "../postgres_ffi" }
zenith_metrics = { path = "../zenith_metrics" }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { path = "../workspace_hack" }
[dev-dependencies]
hex-literal = "0.3"
tempfile = "3.2"

View File

@@ -7,8 +7,9 @@ The Page Server has a few different duties:
- Replay WAL that's applicable to the chunks that the Page Server maintains
- Backup to S3
S3 is the main fault-tolerant storage of all data, as there are no Page Server
replicas. We use a separate fault-tolerant WAL service to reduce latency. It
keeps track of WAL records which are not syncted to S3 yet.
The Page Server consists of multiple threads that operate on a shared
repository of page versions:
@@ -40,7 +41,7 @@ Legend:
+--+
....
. . Component that we will need, but doesn't exist at the moment. A TODO.
. . Component at its early development phase.
....
---> Data flow
@@ -115,13 +116,49 @@ Remove old on-disk layer files that are no longer needed according to the
PITR retention policy
TODO: Backup service
--------------------
### Backup service
The backup service is responsible for periodically pushing the chunks to S3.
The backup service, responsible for storing pageserver recovery data externally.
TODO: How/when do restore from S3? Whenever we get a GetPage@LSN request for
a chunk we don't currently have? Or when an external Control Plane tells us?
Currently, pageserver stores its files in a filesystem directory it's pointed to.
That working directory could be rather ephemeral for such cases as "a pageserver pod running in k8s with no persistent volumes attached".
Therefore, the server interacts with external, more reliable storage to back up and restore its state.
The code for storage support is extensible and can support arbitrary ones as long as they implement a certain Rust trait.
There are the following implementations present:
* local filesystem — to use in tests mainly
* AWS S3 - to use in production
Implementation details are covered in the [backup readme](./src/remote_storage/README.md) and corresponding Rust file docs.
The backup service is disabled by default and can be enabled to interact with a single remote storage.
CLI examples:
* Local FS: `${PAGESERVER_BIN} --remote-storage-local-path="/some/local/path/"`
* AWS S3 : `${PAGESERVER_BIN} --remote-storage-s3-bucket="some-sample-bucket" --remote-storage-region="eu-north-1" --remote-storage-access-key="SOMEKEYAAAAASADSAH*#" --remote-storage-secret-access-key="SOMEsEcReTsd292v"`
For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS.
For local S3 installations, refer to the their documentation for name format and credentials.
Similar to other pageserver settings, toml config file can be used to configure either of the storages as backup targets.
Required sections are:
```toml
[remote_storage]
local_path = '/Users/someonetoignore/Downloads/tmp_dir/'
```
or
```toml
[remote_storage]
bucket_name = 'some-sample-bucket'
bucket_region = 'eu-north-1'
access_key_id = 'SOMEKEYAAAAASADSAH*#'
secret_access_key = 'SOMEsEcReTsd292v'
```
Also, `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` variables can be used to specify the credentials instead of any of the ways above.
TODO: Sharding
--------------------

View File

@@ -10,8 +10,10 @@
//! This module is responsible for creation of such tarball
//! from data stored in object storage.
//!
use anyhow::{Context, Result};
use bytes::{BufMut, BytesMut};
use log::*;
use std::fmt::Write as FmtWrite;
use std::io;
use std::io::Write;
use std::sync::Arc;
@@ -22,7 +24,7 @@ use crate::relish::*;
use crate::repository::Timeline;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::*;
use zenith_utils::lsn::{Lsn, RecordLsn};
use zenith_utils::lsn::Lsn;
/// This is short-living object only for the time of tarball creation,
/// created mostly to avoid passing a lot of parameters between various functions
@@ -30,7 +32,7 @@ use zenith_utils::lsn::{Lsn, RecordLsn};
pub struct Basebackup<'a> {
ar: Builder<&'a mut dyn Write>,
timeline: &'a Arc<dyn Timeline>,
lsn: Lsn,
pub lsn: Lsn,
prev_record_lsn: Lsn,
}
@@ -46,53 +48,56 @@ impl<'a> Basebackup<'a> {
write: &'a mut dyn Write,
timeline: &'a Arc<dyn Timeline>,
req_lsn: Option<Lsn>,
) -> Basebackup<'a> {
// current_prev may be zero if we are at the start of timeline branched from old lsn
let RecordLsn {
last: current_last,
prev: current_prev,
} = timeline.get_last_record_rlsn();
// Compute postgres doesn't have any previous WAL files, but the first record that this
// postgres is going to write need to have LSN of previous record (xl_prev). So we are
// writing prev_lsn to "zenith.signal" file so that postgres can read it during the start.
// In some cases we don't know prev_lsn (branch or basebackup @old_lsn) so pass Lsn(0)
// instead and embrace the wrong xl_prev in this situations.
) -> Result<Basebackup<'a>> {
// Compute postgres doesn't have any previous WAL files, but the first
// record that it's going to write needs to include the LSN of the
// previous record (xl_prev). We include prev_record_lsn in the
// "zenith.signal" file, so that postgres can read it during startup.
//
// We don't keep full history of record boundaries in the page server,
// however, only the predecessor of the latest record on each
// timeline. So we can only provide prev_record_lsn when you take a
// base backup at the end of the timeline, i.e. at last_record_lsn.
// Even at the end of the timeline, we sometimes don't have a valid
// prev_lsn value; that happens if the timeline was just branched from
// an old LSN and it doesn't have any WAL of its own yet. We will set
// prev_lsn to Lsn(0) if we cannot provide the correct value.
let (backup_prev, backup_lsn) = if let Some(req_lsn) = req_lsn {
if req_lsn > current_last {
// FIXME: now wait_lsn() is inside of list_nonrels() so we don't have a way
// to get it from there. It is better to wait just here.
(Lsn(0), req_lsn)
} else if req_lsn < current_last {
// we don't know prev already. We don't currently use basebackup@old_lsn
// but may use it for read only replicas in future
(Lsn(0), req_lsn)
// Backup was requested at a particular LSN. Wait for it to arrive.
timeline.wait_lsn(req_lsn)?;
// If the requested point is the end of the timeline, we can
// provide prev_lsn. (get_last_record_rlsn() might return it as
// zero, though, if no WAL has been generated on this timeline
// yet.)
let end_of_timeline = timeline.get_last_record_rlsn();
if req_lsn == end_of_timeline.last {
(end_of_timeline.prev, req_lsn)
} else {
// we are exactly at req_lsn and know prev
(current_prev, req_lsn)
(Lsn(0), req_lsn)
}
} else {
// None in req_lsn means that we are branching from the latest LSN
(current_prev, current_last)
// Backup was requested at end of the timeline.
let end_of_timeline = timeline.get_last_record_rlsn();
(end_of_timeline.prev, end_of_timeline.last)
};
info!(
"taking basebackup lsn={}, prev_lsn={}",
backup_prev, backup_lsn
backup_lsn, backup_prev
);
Basebackup {
Ok(Basebackup {
ar: Builder::new(write),
timeline,
lsn: backup_lsn,
prev_record_lsn: backup_prev,
}
})
}
pub fn send_tarball(&mut self) -> anyhow::Result<()> {
// Create pgdata subdirs structure
for dir in pg_constants::PGDATA_SUBDIRS.iter() {
info!("send subdir {:?}", *dir);
let header = new_tar_header_dir(*dir)?;
self.ar.append(&header, &mut io::empty())?;
}
@@ -154,11 +159,9 @@ impl<'a> Basebackup<'a> {
let mut slru_buf: Vec<u8> =
Vec::with_capacity(nblocks as usize * pg_constants::BLCKSZ as usize);
for blknum in 0..nblocks {
let img = self.timeline.get_page_at_lsn_nowait(
RelishTag::Slru { slru, segno },
blknum,
self.lsn,
)?;
let img =
self.timeline
.get_page_at_lsn(RelishTag::Slru { slru, segno }, blknum, self.lsn)?;
assert!(img.len() == pg_constants::BLCKSZ as usize);
slru_buf.extend_from_slice(&img);
@@ -177,7 +180,7 @@ impl<'a> Basebackup<'a> {
// Along with them also send PG_VERSION for each database.
//
fn add_relmap_file(&mut self, spcnode: u32, dbnode: u32) -> anyhow::Result<()> {
let img = self.timeline.get_page_at_lsn_nowait(
let img = self.timeline.get_page_at_lsn(
RelishTag::FileNodeMap { spcnode, dbnode },
0,
self.lsn,
@@ -219,7 +222,7 @@ impl<'a> Basebackup<'a> {
fn add_twophase_file(&mut self, xid: TransactionId) -> anyhow::Result<()> {
let img = self
.timeline
.get_page_at_lsn_nowait(RelishTag::TwoPhase { xid }, 0, self.lsn)?;
.get_page_at_lsn(RelishTag::TwoPhase { xid }, 0, self.lsn)?;
let mut buf = BytesMut::new();
buf.extend_from_slice(&img[..]);
@@ -237,22 +240,18 @@ impl<'a> Basebackup<'a> {
// Also send zenith.signal file with extra bootstrap data.
//
fn add_pgcontrol_file(&mut self) -> anyhow::Result<()> {
let checkpoint_bytes =
self.timeline
.get_page_at_lsn_nowait(RelishTag::Checkpoint, 0, self.lsn)?;
let pg_control_bytes =
self.timeline
.get_page_at_lsn_nowait(RelishTag::ControlFile, 0, self.lsn)?;
let checkpoint_bytes = self
.timeline
.get_page_at_lsn(RelishTag::Checkpoint, 0, self.lsn)
.context("failed to get checkpoint bytes")?;
let pg_control_bytes = self
.timeline
.get_page_at_lsn(RelishTag::ControlFile, 0, self.lsn)
.context("failed get control bytes")?;
let mut pg_control = ControlFileData::decode(&pg_control_bytes)?;
let mut checkpoint = CheckPoint::decode(&checkpoint_bytes)?;
// Generate new pg_control and WAL needed for bootstrap
let checkpoint_segno = self.lsn.segment_number(pg_constants::WAL_SEGMENT_SIZE);
let checkpoint_lsn = XLogSegNoOffsetToRecPtr(
checkpoint_segno,
XLOG_SIZE_OF_XLOG_LONG_PHD as u32,
pg_constants::WAL_SEGMENT_SIZE,
);
// Generate new pg_control needed for bootstrap
checkpoint.redo = normalize_lsn(self.lsn, pg_constants::WAL_SEGMENT_SIZE).0;
//reset some fields we don't want to preserve
@@ -261,19 +260,24 @@ impl<'a> Basebackup<'a> {
checkpoint.oldestActiveXid = 0;
//save new values in pg_control
pg_control.checkPoint = checkpoint_lsn;
pg_control.checkPoint = 0;
pg_control.checkPointCopy = checkpoint;
pg_control.state = pg_constants::DB_SHUTDOWNED;
// add zenith.signal file
let xl_prev = if self.prev_record_lsn == Lsn(0) {
0xBAD0 // magic value to indicate that we don't know prev_lsn
let mut zenith_signal = String::new();
if self.prev_record_lsn == Lsn(0) {
if self.lsn == self.timeline.get_ancestor_lsn() {
write!(zenith_signal, "PREV LSN: none")?;
} else {
write!(zenith_signal, "PREV LSN: invalid")?;
}
} else {
self.prev_record_lsn.0
};
write!(zenith_signal, "PREV LSN: {}", self.prev_record_lsn)?;
}
self.ar.append(
&new_tar_header("zenith.signal", 8)?,
&xl_prev.to_le_bytes()[..],
&new_tar_header("zenith.signal", zenith_signal.len() as u64)?,
zenith_signal.as_bytes(),
)?;
//send pg_control
@@ -282,14 +286,11 @@ impl<'a> Basebackup<'a> {
self.ar.append(&header, &pg_control_bytes[..])?;
//send wal segment
let wal_file_name = XLogFileName(
1, // FIXME: always use Postgres timeline 1
checkpoint_segno,
pg_constants::WAL_SEGMENT_SIZE,
);
let segno = self.lsn.segment_number(pg_constants::WAL_SEGMENT_SIZE);
let wal_file_name = XLogFileName(PG_TLI, segno, pg_constants::WAL_SEGMENT_SIZE);
let wal_file_path = format!("pg_wal/{}", wal_file_name);
let header = new_tar_header(&wal_file_path, pg_constants::WAL_SEGMENT_SIZE as u64)?;
let wal_seg = generate_wal_segment(&pg_control);
let wal_seg = generate_wal_segment(segno, pg_control.system_identifier);
assert!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE);
self.ar.append(&header, &wal_seg[..])?;
Ok(())

View File

@@ -4,11 +4,14 @@
use anyhow::Result;
use clap::{App, Arg};
use pageserver::layered_repository::dump_layerfile_from_path;
use pageserver::virtual_file;
use std::path::PathBuf;
use zenith_utils::GIT_VERSION;
fn main() -> Result<()> {
let arg_matches = App::new("Zenith dump_layerfile utility")
.about("Dump contents of one layer file, for debugging")
.version(GIT_VERSION)
.arg(
Arg::with_name("path")
.help("Path to file to dump")
@@ -19,6 +22,9 @@ fn main() -> Result<()> {
let path = PathBuf::from(arg_matches.value_of("path").unwrap());
// Basic initialization of things that don't change after startup
virtual_file::init(10);
dump_layerfile_from_path(&path)?;
Ok(())

View File

@@ -2,44 +2,76 @@
// Main entry point for the Page Server executable
//
use log::*;
use serde::{Deserialize, Serialize};
use std::{
env,
net::TcpListener,
num::{NonZeroU32, NonZeroUsize},
path::{Path, PathBuf},
process::exit,
str::FromStr,
thread,
time::Duration,
};
use zenith_utils::{auth::JwtAuth, postgres_backend::AuthType};
use tracing::*;
use zenith_utils::{auth::JwtAuth, logging, postgres_backend::AuthType, tcp_listener, GIT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use anyhow::{ensure, Result};
use clap::{App, Arg, ArgMatches};
use daemonize::Daemonize;
use pageserver::{branches, http, logger, page_service, tenant_mgr, PageServerConf};
use pageserver::{
branches, defaults::*, http, page_cache, page_service, remote_storage, tenant_mgr,
virtual_file, PageServerConf, RemoteStorageConfig, RemoteStorageKind, S3Config, LOG_FILE_NAME,
};
use zenith_utils::http::endpoint;
use zenith_utils::postgres_backend;
use zenith_utils::shutdown::exit_now;
use zenith_utils::signals::{self, Signal};
const DEFAULT_LISTEN_ADDR: &str = "127.0.0.1:64000";
const DEFAULT_HTTP_ENDPOINT_ADDR: &str = "127.0.0.1:9898";
const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
const DEFAULT_GC_PERIOD: Duration = Duration::from_secs(10);
const DEFAULT_SUPERUSER: &str = "zenith_admin";
use const_format::formatcp;
/// String arguments that can be declared via CLI or config file
#[derive(Serialize, Deserialize)]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone)]
struct CfgFileParams {
listen_addr: Option<String>,
http_endpoint_addr: Option<String>,
listen_pg_addr: Option<String>,
listen_http_addr: Option<String>,
checkpoint_distance: Option<String>,
checkpoint_period: Option<String>,
gc_horizon: Option<String>,
gc_period: Option<String>,
open_mem_limit: Option<String>,
page_cache_size: Option<String>,
max_file_descriptors: Option<String>,
pg_distrib_dir: Option<String>,
auth_validation_public_key_path: Option<String>,
auth_type: Option<String>,
remote_storage_max_concurrent_sync: Option<String>,
remote_storage_max_sync_errors: Option<String>,
/////////////////////////////////
//// Don't put `Option<String>` and other "simple" values below.
////
/// `Option<RemoteStorage>` is a <a href='https://toml.io/en/v1.0.0#table'>table</a> in TOML.
/// Values in TOML cannot be defined after tables (other tables can),
/// and [`toml`] crate serializes all fields in the order of their appearance.
////////////////////////////////
remote_storage: Option<RemoteStorage>,
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone)]
// Without this attribute, enums with values won't be serialized by the `toml` library (but can be deserialized nonetheless!).
// See https://github.com/alexcrichton/toml-rs/blob/6c162e6562c3e432bf04c82a3d1d789d80761a86/examples/enum_external.rs for the examples
#[serde(untagged)]
enum RemoteStorage {
Local {
local_path: String,
},
AwsS3 {
bucket_name: String,
bucket_region: String,
#[serde(skip_serializing)]
access_key_id: Option<String>,
#[serde(skip_serializing)]
secret_access_key: Option<String>,
},
}
impl CfgFileParams {
@@ -49,14 +81,37 @@ impl CfgFileParams {
arg_matches.value_of(arg_name).map(str::to_owned)
};
let remote_storage = if let Some(local_path) = get_arg("remote-storage-local-path") {
Some(RemoteStorage::Local { local_path })
} else if let Some((bucket_name, bucket_region)) =
get_arg("remote-storage-s3-bucket").zip(get_arg("remote-storage-region"))
{
Some(RemoteStorage::AwsS3 {
bucket_name,
bucket_region,
access_key_id: get_arg("remote-storage-access-key"),
secret_access_key: get_arg("remote-storage-secret-access-key"),
})
} else {
None
};
Self {
listen_addr: get_arg("listen"),
http_endpoint_addr: get_arg("http_endpoint"),
listen_pg_addr: get_arg("listen_pg_addr"),
listen_http_addr: get_arg("listen_http_addr"),
checkpoint_distance: get_arg("checkpoint_distance"),
checkpoint_period: get_arg("checkpoint_period"),
gc_horizon: get_arg("gc_horizon"),
gc_period: get_arg("gc_period"),
open_mem_limit: get_arg("open_mem_limit"),
page_cache_size: get_arg("page_cache_size"),
max_file_descriptors: get_arg("max_file_descriptors"),
pg_distrib_dir: get_arg("postgres-distrib"),
auth_validation_public_key_path: get_arg("auth-validation-public-key-path"),
auth_type: get_arg("auth-type"),
remote_storage,
remote_storage_max_concurrent_sync: get_arg("remote-storage-max-concurrent-sync"),
remote_storage_max_sync_errors: get_arg("remote-storage-max-sync-errors"),
}
}
@@ -64,15 +119,27 @@ impl CfgFileParams {
fn or(self, other: CfgFileParams) -> Self {
// TODO cleaner way to do this
Self {
listen_addr: self.listen_addr.or(other.listen_addr),
http_endpoint_addr: self.http_endpoint_addr.or(other.http_endpoint_addr),
listen_pg_addr: self.listen_pg_addr.or(other.listen_pg_addr),
listen_http_addr: self.listen_http_addr.or(other.listen_http_addr),
checkpoint_distance: self.checkpoint_distance.or(other.checkpoint_distance),
checkpoint_period: self.checkpoint_period.or(other.checkpoint_period),
gc_horizon: self.gc_horizon.or(other.gc_horizon),
gc_period: self.gc_period.or(other.gc_period),
open_mem_limit: self.open_mem_limit.or(other.open_mem_limit),
page_cache_size: self.page_cache_size.or(other.page_cache_size),
max_file_descriptors: self.max_file_descriptors.or(other.max_file_descriptors),
pg_distrib_dir: self.pg_distrib_dir.or(other.pg_distrib_dir),
auth_validation_public_key_path: self
.auth_validation_public_key_path
.or(other.auth_validation_public_key_path),
auth_type: self.auth_type.or(other.auth_type),
remote_storage: self.remote_storage.or(other.remote_storage),
remote_storage_max_concurrent_sync: self
.remote_storage_max_concurrent_sync
.or(other.remote_storage_max_concurrent_sync),
remote_storage_max_sync_errors: self
.remote_storage_max_sync_errors
.or(other.remote_storage_max_sync_errors),
}
}
@@ -80,14 +147,23 @@ impl CfgFileParams {
fn try_into_config(&self) -> Result<PageServerConf> {
let workdir = PathBuf::from(".");
let listen_addr = match self.listen_addr.as_ref() {
let listen_pg_addr = match self.listen_pg_addr.as_ref() {
Some(addr) => addr.clone(),
None => DEFAULT_LISTEN_ADDR.to_owned(),
None => DEFAULT_PG_LISTEN_ADDR.to_owned(),
};
let http_endpoint_addr = match self.http_endpoint_addr.as_ref() {
let listen_http_addr = match self.listen_http_addr.as_ref() {
Some(addr) => addr.clone(),
None => DEFAULT_HTTP_ENDPOINT_ADDR.to_owned(),
None => DEFAULT_HTTP_LISTEN_ADDR.to_owned(),
};
let checkpoint_distance: u64 = match self.checkpoint_distance.as_ref() {
Some(checkpoint_distance_str) => checkpoint_distance_str.parse()?,
None => DEFAULT_CHECKPOINT_DISTANCE,
};
let checkpoint_period = match self.checkpoint_period.as_ref() {
Some(checkpoint_period_str) => humantime::parse_duration(checkpoint_period_str)?,
None => DEFAULT_CHECKPOINT_PERIOD,
};
let gc_horizon: u64 = match self.gc_horizon.as_ref() {
@@ -99,6 +175,21 @@ impl CfgFileParams {
None => DEFAULT_GC_PERIOD,
};
let open_mem_limit: usize = match self.open_mem_limit.as_ref() {
Some(open_mem_limit_str) => open_mem_limit_str.parse()?,
None => DEFAULT_OPEN_MEM_LIMIT,
};
let page_cache_size: usize = match self.page_cache_size.as_ref() {
Some(page_cache_size_str) => page_cache_size_str.parse()?,
None => DEFAULT_PAGE_CACHE_SIZE,
};
let max_file_descriptors: usize = match self.max_file_descriptors.as_ref() {
Some(max_file_descriptors_str) => max_file_descriptors_str.parse()?,
None => DEFAULT_MAX_FILE_DESCRIPTORS,
};
let pg_distrib_dir = match self.pg_distrib_dir.as_ref() {
Some(pg_distrib_dir_str) => PathBuf::from(pg_distrib_dir_str),
None => env::current_dir()?.join("tmp_install"),
@@ -117,7 +208,7 @@ impl CfgFileParams {
})?;
if !pg_distrib_dir.join("bin/postgres").exists() {
anyhow::bail!("Can't find postgres binary at {:?}", pg_distrib_dir);
bail!("Can't find postgres binary at {:?}", pg_distrib_dir);
}
if auth_type == AuthType::ZenithJWT {
@@ -132,13 +223,50 @@ impl CfgFileParams {
);
}
let max_concurrent_sync = match self.remote_storage_max_concurrent_sync.as_deref() {
Some(number_str) => number_str.parse()?,
None => NonZeroUsize::new(DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC).unwrap(),
};
let max_sync_errors = match self.remote_storage_max_sync_errors.as_deref() {
Some(number_str) => number_str.parse()?,
None => NonZeroU32::new(DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS).unwrap(),
};
let remote_storage_config = self.remote_storage.as_ref().map(|storage_params| {
let storage = match storage_params.clone() {
RemoteStorage::Local { local_path } => {
RemoteStorageKind::LocalFs(PathBuf::from(local_path))
}
RemoteStorage::AwsS3 {
bucket_name,
bucket_region,
access_key_id,
secret_access_key,
} => RemoteStorageKind::AwsS3(S3Config {
bucket_name,
bucket_region,
access_key_id,
secret_access_key,
}),
};
RemoteStorageConfig {
max_concurrent_sync,
max_sync_errors,
storage,
}
});
Ok(PageServerConf {
daemonize: false,
listen_addr,
http_endpoint_addr,
listen_pg_addr,
listen_http_addr,
checkpoint_distance,
checkpoint_period,
gc_horizon,
gc_period,
open_mem_limit,
page_cache_size,
max_file_descriptors,
superuser: String::from(DEFAULT_SUPERUSER),
@@ -148,19 +276,30 @@ impl CfgFileParams {
auth_validation_public_key_path,
auth_type,
remote_storage_config,
})
}
}
fn main() -> Result<()> {
zenith_metrics::set_common_metrics_prefix("pageserver");
let arg_matches = App::new("Zenith page server")
.about("Materializes WAL stream to pages and serves them to the postgres")
.version(GIT_VERSION)
.arg(
Arg::with_name("listen")
Arg::with_name("listen_pg_addr")
.short("l")
.long("listen")
.long("listen_pg_addr")
.aliases(&["listen", "listen-pg"]) // keep some compatibility
.takes_value(true)
.help("listen for incoming page requests on ip:port (default: 127.0.0.1:5430)"),
.help(formatcp!("listen for incoming page requests on ip:port (default: {DEFAULT_PG_LISTEN_ADDR})")),
)
.arg(
Arg::with_name("listen_http_addr")
.long("listen_http_addr")
.aliases(&["http_endpoint", "listen-http"]) // keep some compatibility
.takes_value(true)
.help(formatcp!("http endpoint address for metrics and management API calls on ip:port (default: {DEFAULT_HTTP_LISTEN_ADDR})")),
)
.arg(
Arg::with_name("daemonize")
@@ -175,6 +314,18 @@ fn main() -> Result<()> {
.takes_value(false)
.help("Initialize pageserver repo"),
)
.arg(
Arg::with_name("checkpoint_distance")
.long("checkpoint_distance")
.takes_value(true)
.help("Distance from current LSN to perform checkpoint of in-memory layers"),
)
.arg(
Arg::with_name("checkpoint_period")
.long("checkpoint_period")
.takes_value(true)
.help("Interval between checkpoint iterations"),
)
.arg(
Arg::with_name("gc_horizon")
.long("gc_horizon")
@@ -187,6 +338,25 @@ fn main() -> Result<()> {
.takes_value(true)
.help("Interval between garbage collector iterations"),
)
.arg(
Arg::with_name("open_mem_limit")
.long("open_mem_limit")
.takes_value(true)
.help("Amount of memory reserved for buffering incoming WAL"),
)
.arg(
Arg::with_name("page_cache_size")
.long("page_cache_size")
.takes_value(true)
.help("Number of pages in the page cache"),
)
.arg(
Arg::with_name("max_file_descriptors")
.long("max_file_descriptors")
.takes_value(true)
.help("Max number of file descriptors to keep open for files"),
)
.arg(
Arg::with_name("workdir")
.short("D")
@@ -219,10 +389,56 @@ fn main() -> Result<()> {
.takes_value(true)
.help("Authentication scheme type. One of: Trust, MD5, ZenithJWT"),
)
.arg(
Arg::with_name("remote-storage-local-path")
.long("remote-storage-local-path")
.takes_value(true)
.help("Path to the local directory, to be used as an external remote storage")
.conflicts_with_all(&[
"remote-storage-s3-bucket",
"remote-storage-region",
"remote-storage-access-key",
"remote-storage-secret-access-key",
]),
)
.arg(
Arg::with_name("remote-storage-s3-bucket")
.long("remote-storage-s3-bucket")
.takes_value(true)
.help("Name of the AWS S3 bucket to use an external remote storage")
.requires("remote-storage-region"),
)
.arg(
Arg::with_name("remote-storage-region")
.long("remote-storage-region")
.takes_value(true)
.help("Region of the AWS S3 bucket"),
)
.arg(
Arg::with_name("remote-storage-access-key")
.long("remote-storage-access-key")
.takes_value(true)
.help("Credentials to access the AWS S3 bucket"),
)
.arg(
Arg::with_name("remote-storage-secret-access-key")
.long("remote-storage-secret-access-key")
.takes_value(true)
.help("Credentials to access the AWS S3 bucket"),
)
.arg(
Arg::with_name("remote-storage-max-concurrent-sync")
.long("remote-storage-max-concurrent-sync")
.takes_value(true)
.help("Maximum allowed concurrent synchronisations with storage"),
)
.get_matches();
let workdir = Path::new(arg_matches.value_of("workdir").unwrap_or(".zenith"));
let cfg_file_path = workdir.canonicalize()?.join("pageserver.toml");
let cfg_file_path = workdir
.canonicalize()
.with_context(|| format!("Error opening workdir '{}'", workdir.display()))?
.join("pageserver.toml");
let args_params = CfgFileParams::from_args(&arg_matches);
@@ -234,22 +450,37 @@ fn main() -> Result<()> {
args_params
} else {
// Supplement the CLI arguments with the config file
let cfg_file_contents = std::fs::read_to_string(&cfg_file_path)?;
let file_params: CfgFileParams = toml::from_str(&cfg_file_contents)?;
let cfg_file_contents = std::fs::read_to_string(&cfg_file_path)
.with_context(|| format!("No pageserver config at '{}'", cfg_file_path.display()))?;
let file_params: CfgFileParams = toml::from_str(&cfg_file_contents).with_context(|| {
format!(
"Failed to read '{}' as pageserver config",
cfg_file_path.display()
)
})?;
args_params.or(file_params)
};
// Set CWD to workdir for non-daemon modes
env::set_current_dir(&workdir)?;
env::set_current_dir(&workdir).with_context(|| {
format!(
"Failed to set application's current dir to '{}'",
workdir.display()
)
})?;
// Ensure the config is valid, even if just init-ing
let mut conf = params.try_into_config()?;
let mut conf = params.try_into_config().with_context(|| {
format!(
"Pageserver config at '{}' is not valid",
cfg_file_path.display()
)
})?;
conf.daemonize = arg_matches.is_present("daemonize");
if init && conf.daemonize {
eprintln!("--daemonize cannot be used with --init");
exit(1);
bail!("--daemonize cannot be used with --init")
}
// The configuration is all set up now. Turn it into a 'static
@@ -257,30 +488,52 @@ fn main() -> Result<()> {
// as a ref.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
page_cache::init(conf);
// Create repo and exit if init was requested
if init {
branches::init_pageserver(conf, create_tenant)?;
branches::init_pageserver(conf, create_tenant).context("Failed to init pageserver")?;
// write the config file
let cfg_file_contents = toml::to_string_pretty(&params)?;
let cfg_file_contents = toml::to_string_pretty(&params)
.context("Failed to create pageserver config contents for initialisation")?;
// TODO support enable-auth flag
std::fs::write(&cfg_file_path, cfg_file_contents)?;
return Ok(());
std::fs::write(&cfg_file_path, cfg_file_contents).with_context(|| {
format!(
"Failed to initialize pageserver config at '{}'",
cfg_file_path.display()
)
})?;
Ok(())
} else {
start_pageserver(conf).context("Failed to start pageserver")
}
start_pageserver(conf)
}
fn start_pageserver(conf: &'static PageServerConf) -> Result<()> {
// Initialize logger
let (_scope_guard, log_file) = logger::init_logging(conf, "pageserver.log")?;
let _log_guard = slog_stdlog::init()?;
let log_file = logging::init(LOG_FILE_NAME, conf.daemonize)?;
// Note: this `info!(...)` macro comes from `log` crate
info!("standard logging redirected to slog");
info!("version: {}", GIT_VERSION);
// TODO: Check that it looks like a valid repository before going further
// bind sockets before daemonizing so we report errors early and do not return until we are listening
info!(
"Starting pageserver http handler on {}",
conf.listen_http_addr
);
let http_listener = tcp_listener::bind(conf.listen_http_addr.clone())?;
info!(
"Starting pageserver pg protocol handler on {}",
conf.listen_pg_addr
);
let pageserver_listener = tcp_listener::bind(conf.listen_pg_addr.clone())?;
// XXX: Don't spawn any threads before daemonizing!
if conf.daemonize {
info!("daemonizing...");
@@ -295,12 +548,25 @@ fn start_pageserver(conf: &'static PageServerConf) -> Result<()> {
.stdout(stdout)
.stderr(stderr);
match daemonize.start() {
// XXX: The parent process should exit abruptly right after
// it has spawned a child to prevent coverage machinery from
// dumping stats into a `profraw` file now owned by the child.
// Otherwise, the coverage data will be damaged.
match daemonize.exit_action(|| exit_now(0)).start() {
Ok(_) => info!("Success, daemonized"),
Err(e) => error!("Error, {}", e),
Err(err) => error!(%err, "could not daemonize"),
}
}
let signals = signals::install_shutdown_handlers()?;
let mut threads = vec![];
if let Some(handle) = remote_storage::run_storage_sync_thread(conf)? {
threads.push(handle);
}
// Initialize tenant manager.
tenant_mgr::init(conf);
// initialize authentication for incoming connections
let auth = match &conf.auth_type {
AuthType::Trust | AuthType::MD5 => None,
@@ -313,33 +579,206 @@ fn start_pageserver(conf: &'static PageServerConf) -> Result<()> {
info!("Using auth: {:#?}", conf.auth_type);
// Spawn a new thread for the http endpoint
// bind before launching separate thread so the error reported before startup exits
let cloned = auth.clone();
thread::Builder::new()
.name("http_endpoint_thread".into())
.spawn(move || {
let router = http::make_router(conf, cloned);
endpoint::serve_thread_main(router, conf.http_endpoint_addr.clone())
})?;
// Check that we can bind to address before starting threads to simplify shutdown
// sequence if port is occupied.
info!("Starting pageserver on {}", conf.listen_addr);
let pageserver_listener = TcpListener::bind(conf.listen_addr.clone())?;
// Initialize tenant manager.
tenant_mgr::init(conf);
threads.push(
thread::Builder::new()
.name("http_endpoint_thread".into())
.spawn(move || {
let router = http::make_router(conf, cloned);
endpoint::serve_thread_main(router, http_listener)
})?,
);
// Spawn a thread to listen for connections. It will spawn further threads
// for each connection.
let page_service_thread = thread::Builder::new()
.name("Page Service thread".into())
.spawn(move || {
page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type)
})?;
threads.push(
thread::Builder::new()
.name("Page Service thread".into())
.spawn(move || {
page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type)
})?,
);
page_service_thread
.join()
.expect("Page service thread has panicked")?;
signals.handle(|signal| match signal {
Signal::Quit => {
info!(
"Got {}. Terminating in immediate shutdown mode",
signal.name()
);
std::process::exit(111);
}
Ok(())
Signal::Interrupt | Signal::Terminate => {
info!(
"Got {}. Terminating gracefully in fast shutdown mode",
signal.name()
);
postgres_backend::set_pgbackend_shutdown_requested();
tenant_mgr::shutdown_all_tenants()?;
endpoint::shutdown();
for handle in std::mem::take(&mut threads) {
handle
.join()
.expect("thread panicked")
.expect("thread exited with an error");
}
info!("Shut down successfully completed");
std::process::exit(0);
}
})
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn page_server_conf_toml_serde() {
let params = CfgFileParams {
listen_pg_addr: Some("listen_pg_addr_VALUE".to_string()),
listen_http_addr: Some("listen_http_addr_VALUE".to_string()),
checkpoint_distance: Some("checkpoint_distance_VALUE".to_string()),
checkpoint_period: Some("checkpoint_period_VALUE".to_string()),
gc_horizon: Some("gc_horizon_VALUE".to_string()),
gc_period: Some("gc_period_VALUE".to_string()),
open_mem_limit: Some("open_mem_limit_VALUE".to_string()),
page_cache_size: Some("page_cache_size_VALUE".to_string()),
max_file_descriptors: Some("max_file_descriptors_VALUE".to_string()),
pg_distrib_dir: Some("pg_distrib_dir_VALUE".to_string()),
auth_validation_public_key_path: Some(
"auth_validation_public_key_path_VALUE".to_string(),
),
auth_type: Some("auth_type_VALUE".to_string()),
remote_storage: Some(RemoteStorage::Local {
local_path: "remote_storage_local_VALUE".to_string(),
}),
remote_storage_max_concurrent_sync: Some(
"remote_storage_max_concurrent_sync_VALUE".to_string(),
),
remote_storage_max_sync_errors: Some(
"remote_storage_max_sync_errors_VALUE".to_string(),
),
};
let toml_string = toml::to_string(&params).expect("Failed to serialize correct config");
let toml_pretty_string =
toml::to_string_pretty(&params).expect("Failed to serialize correct config");
assert_eq!(
r#"listen_pg_addr = 'listen_pg_addr_VALUE'
listen_http_addr = 'listen_http_addr_VALUE'
checkpoint_distance = 'checkpoint_distance_VALUE'
checkpoint_period = 'checkpoint_period_VALUE'
gc_horizon = 'gc_horizon_VALUE'
gc_period = 'gc_period_VALUE'
open_mem_limit = 'open_mem_limit_VALUE'
page_cache_size = 'page_cache_size_VALUE'
max_file_descriptors = 'max_file_descriptors_VALUE'
pg_distrib_dir = 'pg_distrib_dir_VALUE'
auth_validation_public_key_path = 'auth_validation_public_key_path_VALUE'
auth_type = 'auth_type_VALUE'
remote_storage_max_concurrent_sync = 'remote_storage_max_concurrent_sync_VALUE'
remote_storage_max_sync_errors = 'remote_storage_max_sync_errors_VALUE'
[remote_storage]
local_path = 'remote_storage_local_VALUE'
"#,
toml_pretty_string
);
let params_from_serialized: CfgFileParams = toml::from_str(&toml_string)
.expect("Failed to deserialize the serialization result of the config");
let params_from_serialized_pretty: CfgFileParams = toml::from_str(&toml_pretty_string)
.expect("Failed to deserialize the prettified serialization result of the config");
assert!(
params_from_serialized == params,
"Expected the same config in the end of config -> serialize -> deserialize chain"
);
assert!(
params_from_serialized_pretty == params,
"Expected the same config in the end of config -> serialize pretty -> deserialize chain"
);
}
#[test]
fn credentials_omitted_during_serialization() {
let params = CfgFileParams {
listen_pg_addr: Some("listen_pg_addr_VALUE".to_string()),
listen_http_addr: Some("listen_http_addr_VALUE".to_string()),
checkpoint_distance: Some("checkpoint_distance_VALUE".to_string()),
checkpoint_period: Some("checkpoint_period_VALUE".to_string()),
gc_horizon: Some("gc_horizon_VALUE".to_string()),
gc_period: Some("gc_period_VALUE".to_string()),
open_mem_limit: Some("open_mem_limit_VALUE".to_string()),
page_cache_size: Some("page_cache_size_VALUE".to_string()),
max_file_descriptors: Some("max_file_descriptors_VALUE".to_string()),
pg_distrib_dir: Some("pg_distrib_dir_VALUE".to_string()),
auth_validation_public_key_path: Some(
"auth_validation_public_key_path_VALUE".to_string(),
),
auth_type: Some("auth_type_VALUE".to_string()),
remote_storage: Some(RemoteStorage::AwsS3 {
bucket_name: "bucket_name_VALUE".to_string(),
bucket_region: "bucket_region_VALUE".to_string(),
access_key_id: Some("access_key_id_VALUE".to_string()),
secret_access_key: Some("secret_access_key_VALUE".to_string()),
}),
remote_storage_max_concurrent_sync: Some(
"remote_storage_max_concurrent_sync_VALUE".to_string(),
),
remote_storage_max_sync_errors: Some(
"remote_storage_max_sync_errors_VALUE".to_string(),
),
};
let toml_string = toml::to_string(&params).expect("Failed to serialize correct config");
let toml_pretty_string =
toml::to_string_pretty(&params).expect("Failed to serialize correct config");
assert_eq!(
r#"listen_pg_addr = 'listen_pg_addr_VALUE'
listen_http_addr = 'listen_http_addr_VALUE'
checkpoint_distance = 'checkpoint_distance_VALUE'
checkpoint_period = 'checkpoint_period_VALUE'
gc_horizon = 'gc_horizon_VALUE'
gc_period = 'gc_period_VALUE'
open_mem_limit = 'open_mem_limit_VALUE'
page_cache_size = 'page_cache_size_VALUE'
max_file_descriptors = 'max_file_descriptors_VALUE'
pg_distrib_dir = 'pg_distrib_dir_VALUE'
auth_validation_public_key_path = 'auth_validation_public_key_path_VALUE'
auth_type = 'auth_type_VALUE'
remote_storage_max_concurrent_sync = 'remote_storage_max_concurrent_sync_VALUE'
remote_storage_max_sync_errors = 'remote_storage_max_sync_errors_VALUE'
[remote_storage]
bucket_name = 'bucket_name_VALUE'
bucket_region = 'bucket_region_VALUE'
"#,
toml_pretty_string
);
let params_from_serialized: CfgFileParams = toml::from_str(&toml_string)
.expect("Failed to deserialize the serialization result of the config");
let params_from_serialized_pretty: CfgFileParams = toml::from_str(&toml_pretty_string)
.expect("Failed to deserialize the prettified serialization result of the config");
let mut expected_params = params;
expected_params.remote_storage = Some(RemoteStorage::AwsS3 {
bucket_name: "bucket_name_VALUE".to_string(),
bucket_region: "bucket_region_VALUE".to_string(),
access_key_id: None,
secret_access_key: None,
});
assert!(
params_from_serialized == expected_params,
"Expected the config without credentials in the end of a 'config -> serialize -> deserialize' chain"
);
assert!(
params_from_serialized_pretty == expected_params,
"Expected the config without credentials in the end of a 'config -> serialize pretty -> deserialize' chain"
);
}
}

View File

@@ -4,7 +4,7 @@
// TODO: move all paths construction to conf impl
//
use anyhow::{bail, ensure, Context, Result};
use anyhow::{bail, Context, Result};
use postgres_ffi::ControlFileData;
use serde::{Deserialize, Serialize};
use std::{
@@ -14,25 +14,75 @@ use std::{
str::FromStr,
sync::Arc,
};
use tracing::*;
use zenith_utils::crashsafe_dir;
use zenith_utils::logging;
use zenith_utils::lsn::Lsn;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use log::*;
use zenith_utils::lsn::Lsn;
use crate::logger;
use crate::restore_local_repo;
use crate::tenant_mgr;
use crate::walredo::WalRedoManager;
use crate::CheckpointConfig;
use crate::{repository::Repository, PageServerConf};
use crate::{restore_local_repo, LOG_FILE_NAME};
#[derive(Serialize, Deserialize, Clone)]
pub struct BranchInfo {
pub name: String,
#[serde(with = "hex")]
pub timeline_id: ZTimelineId,
pub latest_valid_lsn: Option<Lsn>,
pub latest_valid_lsn: Lsn,
pub ancestor_id: Option<String>,
pub ancestor_lsn: Option<String>,
pub current_logical_size: usize,
pub current_logical_size_non_incremental: Option<usize>,
}
impl BranchInfo {
pub fn from_path<T: AsRef<Path>>(
path: T,
repo: &Arc<dyn Repository>,
include_non_incremental_logical_size: bool,
) -> Result<Self> {
let name = path
.as_ref()
.file_name()
.unwrap()
.to_str()
.unwrap()
.to_string();
let timeline_id = std::fs::read_to_string(path)?.parse::<ZTimelineId>()?;
let timeline = repo.get_timeline(timeline_id)?;
// we use ancestor lsn zero if we don't have an ancestor, so turn this into an option based on timeline id
let (ancestor_id, ancestor_lsn) = match timeline.get_ancestor_timeline_id() {
Some(ancestor_id) => (
Some(ancestor_id.to_string()),
Some(timeline.get_ancestor_lsn().to_string()),
),
None => (None, None),
};
// non incremental size calculation can be heavy, so let it be optional
// needed for tests to check size calculation
let current_logical_size_non_incremental = include_non_incremental_logical_size
.then(|| {
timeline.get_current_logical_size_non_incremental(timeline.get_last_record_lsn())
})
.transpose()?;
Ok(BranchInfo {
name,
timeline_id,
latest_valid_lsn: timeline.get_last_record_lsn(),
ancestor_id,
ancestor_lsn,
current_logical_size: timeline.get_current_logical_size(),
current_logical_size_non_incremental,
})
}
}
#[derive(Debug, Clone, Copy)]
@@ -43,8 +93,8 @@ pub struct PointInTime {
pub fn init_pageserver(conf: &'static PageServerConf, create_tenant: Option<&str>) -> Result<()> {
// Initialize logger
let (_scope_guard, _log_file) = logger::init_logging(conf, "pageserver.log")?;
let _log_guard = slog_stdlog::init()?;
// use true as daemonize parameter because otherwise we pollute zenith cli output with a few pages long output of info messages
let _log_file = logging::init(LOG_FILE_NAME, true)?;
// We don't use the real WAL redo manager, because we don't want to spawn the WAL redo
// process during repository initialization.
@@ -63,7 +113,7 @@ pub fn init_pageserver(conf: &'static PageServerConf, create_tenant: Option<&str
println!("initializing tenantid {}", tenantid);
create_repo(conf, tenantid, dummy_redo_mgr).with_context(|| "failed to create repo")?;
}
fs::create_dir_all(conf.tenants_path())?;
crashsafe_dir::create_dir_all(conf.tenants_path())?;
println!("pageserver init succeeded");
Ok(())
@@ -80,30 +130,32 @@ pub fn create_repo(
}
// top-level dir may exist if we are creating it through CLI
fs::create_dir_all(&repo_dir)
crashsafe_dir::create_dir_all(&repo_dir)
.with_context(|| format!("could not create directory {}", repo_dir.display()))?;
// Note: this `info!(...)` macro comes from `log` crate
info!("standard logging redirected to slog");
fs::create_dir(conf.timelines_path(&tenantid))?;
fs::create_dir_all(conf.branches_path(&tenantid))?;
fs::create_dir_all(conf.tags_path(&tenantid))?;
crashsafe_dir::create_dir(conf.timelines_path(&tenantid))?;
crashsafe_dir::create_dir_all(conf.branches_path(&tenantid))?;
crashsafe_dir::create_dir_all(conf.tags_path(&tenantid))?;
info!("created directory structure in {}", repo_dir.display());
let tli = create_timeline(conf, None, &tenantid)?;
// create a new timeline directory
let timeline_id = ZTimelineId::generate();
let timelinedir = conf.timeline_path(&timeline_id, &tenantid);
crashsafe_dir::create_dir(&timelinedir)?;
let repo = Arc::new(crate::layered_repository::LayeredRepository::new(
conf,
wal_redo_manager,
tenantid,
false,
));
// Load data into pageserver
// TODO To implement zenith import we need to
// move data loading out of create_repo()
bootstrap_timeline(conf, tenantid, tli, &*repo)?;
bootstrap_timeline(conf, tenantid, timeline_id, repo.as_ref())?;
Ok(repo)
}
@@ -118,17 +170,20 @@ fn get_lsn_from_controlfile(path: &Path) -> Result<Lsn> {
Ok(Lsn(lsn))
}
// Create the cluster temporarily in a initdbpath directory inside the repository
// Create the cluster temporarily in 'initdbpath' directory inside the repository
// to get bootstrap data for timeline initialization.
//
fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> {
info!("running initdb... ");
info!("running initdb in {}... ", initdbpath.display());
let initdb_path = conf.pg_bin_dir().join("initdb");
let initdb_output = Command::new(initdb_path)
.args(&["-D", initdbpath.to_str().unwrap()])
.args(&["-U", &conf.superuser])
.arg("--no-instructions")
// This is only used for a temporary installation that is deleted shortly after,
// so no need to fsync it
.arg("--no-sync")
.env_clear()
.env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap())
@@ -141,7 +196,6 @@ fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> {
String::from_utf8_lossy(&initdb_output.stderr)
);
}
info!("initdb succeeded");
Ok(())
}
@@ -156,6 +210,8 @@ fn bootstrap_timeline(
tli: ZTimelineId,
repo: &dyn Repository,
) -> Result<()> {
let _enter = info_span!("bootstrapping", timeline = %tli, tenant = %tenantid).entered();
let initdb_path = conf.tenant_path(&tenantid).join("tmp");
// Init temporarily repo to get bootstrap data
@@ -164,13 +220,17 @@ fn bootstrap_timeline(
let lsn = get_lsn_from_controlfile(&pgdata_path)?.align();
info!("bootstrap_timeline {:?} at lsn {}", pgdata_path, lsn);
// Import the contents of the data directory at the initial checkpoint
// LSN, and any WAL after that.
// Initdb lsn will be equal to last_record_lsn which will be set after import.
// Because we know it upfront avoid having an option or dummy zero value by passing it to create_empty_timeline.
let timeline = repo.create_empty_timeline(tli, lsn)?;
restore_local_repo::import_timeline_from_postgres_datadir(&pgdata_path, &*timeline, lsn)?;
let wal_dir = pgdata_path.join("pg_wal");
restore_local_repo::import_timeline_wal(&wal_dir, &*timeline, timeline.get_last_record_lsn())?;
restore_local_repo::import_timeline_from_postgres_datadir(
&pgdata_path,
timeline.writer().as_ref(),
lsn,
)?;
timeline.checkpoint(CheckpointConfig::Forced)?;
println!(
"created initial timeline {} timeline.lsn {}",
@@ -188,65 +248,38 @@ fn bootstrap_timeline(
Ok(())
}
pub(crate) fn get_tenants(conf: &PageServerConf) -> Result<Vec<String>> {
let tenants_dir = conf.tenants_path();
std::fs::read_dir(&tenants_dir)?
.map(|dir_entry_res| {
let dir_entry = dir_entry_res?;
ensure!(dir_entry.file_type()?.is_dir());
Ok(dir_entry.file_name().to_str().unwrap().to_owned())
})
.collect()
}
pub(crate) fn get_branches(conf: &PageServerConf, tenantid: &ZTenantId) -> Result<Vec<BranchInfo>> {
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
pub(crate) fn get_branches(
conf: &PageServerConf,
tenantid: &ZTenantId,
include_non_incremental_logical_size: bool,
) -> Result<Vec<BranchInfo>> {
let repo = tenant_mgr::get_repository_for_tenant(*tenantid)?;
// Each branch has a corresponding record (text file) in the refs/branches
// with timeline_id.
let branches_dir = conf.branches_path(tenantid);
std::fs::read_dir(&branches_dir)?
std::fs::read_dir(&branches_dir)
.with_context(|| {
format!(
"Found no branches directory '{}' for tenant {}",
branches_dir.display(),
tenantid
)
})?
.map(|dir_entry_res| {
let dir_entry = dir_entry_res?;
let name = dir_entry.file_name().to_str().unwrap().to_string();
let timeline_id = std::fs::read_to_string(dir_entry.path())?.parse::<ZTimelineId>()?;
let latest_valid_lsn = repo
.get_timeline(timeline_id)
.map(|timeline| timeline.get_last_record_lsn())
.ok();
let ancestor_path = conf.ancestor_path(&timeline_id, tenantid);
let mut ancestor_id: Option<String> = None;
let mut ancestor_lsn: Option<String> = None;
if ancestor_path.exists() {
let ancestor = std::fs::read_to_string(ancestor_path)?;
let mut strings = ancestor.split('@');
ancestor_id = Some(
strings
.next()
.with_context(|| "wrong branch ancestor point in time format")?
.to_owned(),
);
ancestor_lsn = Some(
strings
.next()
.with_context(|| "wrong branch ancestor point in time format")?
.to_owned(),
);
}
Ok(BranchInfo {
name,
timeline_id,
latest_valid_lsn,
ancestor_id,
ancestor_lsn,
})
let dir_entry = dir_entry_res.with_context(|| {
format!(
"Failed to list branches directory '{}' content for tenant {}",
branches_dir.display(),
tenantid
)
})?;
BranchInfo::from_path(
dir_entry.path(),
&repo,
include_non_incremental_logical_size,
)
})
.collect()
}
@@ -257,7 +290,7 @@ pub(crate) fn create_branch(
startpoint_str: &str,
tenantid: &ZTenantId,
) -> Result<BranchInfo> {
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
let repo = tenant_mgr::get_repository_for_tenant(*tenantid)?;
if conf.branch_path(branchname, tenantid).exists() {
anyhow::bail!("branch {} already exists", branchname);
@@ -270,6 +303,14 @@ pub(crate) fn create_branch(
let end_of_wal = timeline.get_last_record_lsn();
info!("branching at end of WAL: {}", end_of_wal);
startpoint.lsn = end_of_wal;
} else {
// Wait for the WAL to arrive and be processed on the parent branch up
// to the requested branch point. The repository code itself doesn't
// require it, but if we start to receive WAL on the new timeline,
// decoding the new WAL might need to look up previous pages, relation
// sizes etc. and that would get confused if the previous page versions
// are not in the repository yet.
timeline.wait_lsn(startpoint.lsn)?;
}
startpoint.lsn = startpoint.lsn.align();
if timeline.get_start_lsn() > startpoint.lsn {
@@ -281,24 +322,26 @@ pub(crate) fn create_branch(
);
}
// create a new timeline directory for it
let newtli = create_timeline(conf, Some(startpoint), tenantid)?;
let new_timeline_id = ZTimelineId::generate();
// Let the Repository backend do its initialization
repo.branch_timeline(startpoint.timelineid, newtli, startpoint.lsn)?;
// Forward entire timeline creation routine to repository
// backend, so it can do all needed initialization
repo.branch_timeline(startpoint.timelineid, new_timeline_id, startpoint.lsn)?;
// Remember the human-readable branch name for the new timeline.
// FIXME: there's a race condition, if you create a branch with the same
// name concurrently.
let data = newtli.to_string();
let data = new_timeline_id.to_string();
fs::write(conf.branch_path(branchname, tenantid), data)?;
Ok(BranchInfo {
name: branchname.to_string(),
timeline_id: newtli,
latest_valid_lsn: Some(startpoint.lsn),
ancestor_id: None,
ancestor_lsn: None,
timeline_id: new_timeline_id,
latest_valid_lsn: startpoint.lsn,
ancestor_id: Some(startpoint.timelineid.to_string()),
ancestor_lsn: Some(startpoint.lsn.to_string()),
current_logical_size: 0,
current_logical_size_non_incremental: Some(0),
})
}
@@ -374,25 +417,3 @@ fn parse_point_in_time(
bail!("could not parse point-in-time {}", s);
}
fn create_timeline(
conf: &PageServerConf,
ancestor: Option<PointInTime>,
tenantid: &ZTenantId,
) -> Result<ZTimelineId> {
// Create initial timeline
let timelineid = ZTimelineId::generate();
let timelinedir = conf.timeline_path(&timelineid, tenantid);
fs::create_dir(&timelinedir)?;
fs::create_dir(&timelinedir.join("wal"))?;
if let Some(ancestor) = ancestor {
let data = format!("{}@{}", ancestor.timelineid, ancestor.lsn);
fs::write(timelinedir.join("ancestor"), data)?;
}
Ok(timelineid)
}

View File

@@ -25,6 +25,11 @@ paths:
schema:
type: string
format: hex
- name: include-non-incremental-logical-size
in: query
schema:
type: string
description: Controls calculation of current_logical_size_non_incremental
get:
description: Get branches for tenant
responses:
@@ -54,7 +59,57 @@ paths:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/branch/{tenant_id}/{branch_name}:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
format: hex
- name: branch_name
in: path
required: true
schema:
type: string
- name: include-non-incremental-logical-size
in: query
schema:
type: string
description: Controls calculation of current_logical_size_non_incremental
get:
description: Get branches for tenant
responses:
"200":
description: BranchInfo
content:
application/json:
schema:
$ref: "#/components/schemas/BranchInfo"
"400":
description: Error when no tenant id found in path or no branch name
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error
content:
@@ -119,13 +174,13 @@ paths:
description: Get tenants list
responses:
"200":
description: OK
description: TenantInfo
content:
application/json:
schema:
type: array
items:
type: string
$ref: "#/components/schemas/TenantInfo"
"401":
description: Unauthorized Error
content:
@@ -198,11 +253,23 @@ components:
scheme: bearer
bearerFormat: JWT
schemas:
TenantInfo:
type: object
required:
- id
- state
properties:
id:
type: string
state:
type: string
BranchInfo:
type: object
required:
- name
- timeline_id
- latest_valid_lsn
- current_logical_size
properties:
name:
type: string
@@ -213,6 +280,10 @@ components:
type: string
ancestor_lsn:
type: string
current_logical_size:
type: integer
current_logical_size_non_incremental:
type: integer
Error:
type: object
required:

View File

@@ -5,6 +5,7 @@ use hyper::header;
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use routerify::{ext::RequestExt, RouterBuilder};
use tracing::*;
use zenith_utils::auth::JwtAuth;
use zenith_utils::http::endpoint::attach_openapi_ui;
use zenith_utils::http::endpoint::auth_middleware;
@@ -14,14 +15,14 @@ use zenith_utils::http::{
endpoint,
error::HttpErrorBody,
json::{json_request, json_response},
request::get_request_param,
request::parse_request_param,
};
use super::models::BranchCreateRequest;
use super::models::TenantCreateRequest;
use crate::{
branches::{self},
tenant_mgr, PageServerConf, ZTenantId,
};
use crate::branches::BranchInfo;
use crate::{branches, tenant_mgr, PageServerConf, ZTenantId};
#[derive(Debug)]
struct State {
@@ -72,6 +73,7 @@ async fn branch_create_handler(mut request: Request<Body>) -> Result<Response<Bo
check_permission(&request, Some(request_data.tenant_id))?;
let response_data = tokio::task::spawn_blocking(move || {
let _enter = info_span!("/branch_create", name = %request_data.name, tenant = %request_data.tenant_id, startpoint=%request_data.start_point).entered();
branches::create_branch(
get_config(&request),
&request_data.name,
@@ -84,36 +86,71 @@ async fn branch_create_handler(mut request: Request<Body>) -> Result<Response<Bo
Ok(json_response(StatusCode::CREATED, response_data)?)
}
// Gate non incremental logical size calculation behind a flag
// after pgbench -i -s100 calculation took 28ms so if multiplied by the number of timelines
// and tenants it can take noticeable amount of time. Also the value currently used only in tests
fn get_include_non_incremental_logical_size(request: &Request<Body>) -> bool {
request
.uri()
.query()
.map(|v| {
url::form_urlencoded::parse(v.as_bytes())
.into_owned()
.any(|(param, _)| param == "include-non-incremental-logical-size")
})
.unwrap_or(false)
}
async fn branch_list_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenantid: ZTenantId = match request.param("tenant_id") {
Some(arg) => arg
.parse()
.map_err(|_| ApiError::BadRequest("failed to parse tenant id".to_string()))?,
None => {
return Err(ApiError::BadRequest(
"no tenant id specified in path param".to_string(),
))
}
};
let tenantid: ZTenantId = parse_request_param(&request, "tenant_id")?;
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
check_permission(&request, Some(tenantid))?;
let response_data = tokio::task::spawn_blocking(move || {
crate::branches::get_branches(get_config(&request), &tenantid)
let _enter = info_span!("branch_list", tenant = %tenantid).entered();
crate::branches::get_branches(
get_config(&request),
&tenantid,
include_non_incremental_logical_size,
)
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
}
async fn branch_detail_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenantid: ZTenantId = parse_request_param(&request, "tenant_id")?;
let branch_name: String = get_request_param(&request, "branch_name")?.to_string();
let conf = get_state(&request).conf;
let path = conf.branch_path(&branch_name, &tenantid);
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
let response_data = tokio::task::spawn_blocking(move || {
let _enter = info_span!("branch_detail", tenant = %tenantid, branch=%branch_name).entered();
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
BranchInfo::from_path(path, &repo, include_non_incremental_logical_size)
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
}
async fn tenant_list_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
// check for management permission
check_permission(&request, None)?;
let response_data =
tokio::task::spawn_blocking(move || crate::branches::get_tenants(get_config(&request)))
.await
.map_err(ApiError::from_err)??;
let response_data = tokio::task::spawn_blocking(move || {
let _enter = info_span!("tenant_list").entered();
crate::tenant_mgr::list_tenants()
})
.await
.map_err(ApiError::from_err)??;
Ok(json_response(StatusCode::OK, response_data)?)
}
@@ -124,6 +161,7 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
let request_data: TenantCreateRequest = json_request(&mut request).await?;
let response_data = tokio::task::spawn_blocking(move || {
let _enter = info_span!("tenant_create", tenant = %request_data.tenant_id).entered();
tenant_mgr::create_repository_for_tenant(get_config(&request), request_data.tenant_id)
})
.await
@@ -159,6 +197,7 @@ pub fn make_router(
.data(Arc::new(State::new(conf, auth)))
.get("/v1/status", status_handler)
.get("/v1/branch/:tenant_id", branch_list_handler)
.get("/v1/branch/:tenant_id/:branch_name", branch_detail_handler)
.post("/v1/branch", branch_create_handler)
.get("/v1/tenant", tenant_list_handler)
.post("/v1/tenant", tenant_create_handler)

File diff suppressed because it is too large Load Diff

View File

@@ -1,10 +1,131 @@
# Overview
The on-disk format is based on immutable files. The page server
receives a stream of incoming WAL, parses the WAL records to determine
which pages they apply to, and accumulates the incoming changes in
memory. Every now and then, the accumulated changes are written out to
new files.
The on-disk format is based on immutable files. The page server receives a
stream of incoming WAL, parses the WAL records to determine which pages they
apply to, and accumulates the incoming changes in memory. Every now and then,
the accumulated changes are written out to new immutable files. This process is
called checkpointing. Old versions of on-disk files that are not needed by any
timeline are removed by GC process.
The main responsibility of the Page Server is to process the incoming WAL, and
reprocess it into a format that allows reasonably quick access to any page
version.
The incoming WAL contains updates to arbitrary pages in the system. The
distribution depends on the workload: the updates could be totally random, or
there could be a long stream of updates to a single relation when data is bulk
loaded, for example, or something in between. The page server slices the
incoming WAL per relation and page, and packages the sliced WAL into
suitably-sized "layer files". The layer files contain all the history of the
database, back to some reasonable retention period. This system replaces the
base backups and the WAL archive used in a traditional PostgreSQL
installation. The layer files are immutable, they are not modified in-place
after creation. New layer files are created for new incoming WAL, and old layer
files are removed when they are no longer needed. We could also replace layer
files with new files that contain the same information, merging small files for
example, but that hasn't been implemented yet.
Cloud Storage Page Server Safekeeper
Local disk Memory WAL
|AAAA| |AAAA|AAAA| |AA
|BBBB| |BBBB|BBBB| |
|CCCC|CCCC| <---- |CCCC|CCCC|CCCC| <--- |CC <---- ADEBAABED
|DDDD|DDDD| |DDDD|DDDD| |DDD
|EEEE| |EEEE|EEEE|EEEE| |E
In this illustration, WAL is received as a stream from the Safekeeper, from the
right. It is immediately captured by the page server and stored quickly in
memory. The page server memory can be thought of as a quick "reorder buffer",
used to hold the incoming WAL and reorder it so that we keep the WAL records for
the same page and relation close to each other.
From the page server memory, whenever enough WAL has been accumulated for one
relation segment, it is moved to local disk, as a new layer file, and the memory
is released.
From the local disk, the layers are further copied to Cloud Storage, for
long-term archival. After a layer has been copied to Cloud Storage, it can be
removed from local disk, although we currently keep everything locally for fast
access. If a layer is needed that isn't found locally, it is fetched from Cloud
Storage and stored in local disk.
# Terms used in layered repository
- Relish - one PostgreSQL relation or similarly treated file.
- Segment - one slice of a Relish that is stored in a LayeredTimeline.
- Layer - specific version of a relish Segment in a range of LSNs.
# Layer map
The LayerMap tracks what layers exist for all the relishes in a timeline.
LayerMap consists of two data structures:
- segs - All the layers keyed by segment tag
- open_layers - data structure that hold all open layers ordered by oldest_pending_lsn for quick access during checkpointing. oldest_pending_lsn is the LSN of the oldest page version stored in this layer.
All operations that update InMemory Layers should update both structures to keep them up-to-date.
- LayeredTimeline - implements Timeline interface.
All methods of LayeredTimeline are aware of its ancestors and return data taking them into account.
TODO: Are there any exceptions to this?
For example, timeline.list_rels(lsn) will return all segments that are visible in this timeline at the LSN,
including ones that were not modified in this timeline and thus don't have a layer in the timeline's LayerMap.
# Different kinds of layers
A layer can be in different states:
- Open - a layer where new WAL records can be appended to.
- Closed - a layer that is read-only, no new WAL records can be appended to it
- Historic: synonym for closed
- InMemory: A layer that needs to be rebuilt from WAL on pageserver start.
To avoid OOM errors, InMemory layers can be spilled to disk into ephemeral file.
- OnDisk: A layer that is stored on disk. If its end-LSN is older than
disk_consistent_lsn, it is known to be fully flushed and fsync'd to local disk.
- Frozen layer: an in-memory layer that is Closed.
TODO: Clarify the difference between Closed, Historic and Frozen.
There are two kinds of OnDisk layers:
- ImageLayer represents an image or a snapshot of a 10 MB relish segment, at one particular LSN.
- DeltaLayer represents a collection of WAL records or page images in a range of LSNs, for one
relish segment.
Dropped segments are always represented on disk by DeltaLayer.
# Layer life cycle
LSN range defined by start_lsn and end_lsn:
- start_lsn is inclusive.
- end_lsn is exclusive.
For an open in-memory layer, the end_lsn is MAX_LSN. For a frozen in-memory
layer or a delta layer, it is a valid end bound. An image layer represents
snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
Every layer starts its life as an Open In-Memory layer. When the page server
receives the first WAL record for a segment, it creates a new In-Memory layer
for it, and puts it to the layer map. Later, the layer is old enough, its
contents are written to disk, as On-Disk layers. This process is called
"evicting" a layer.
Layer eviction is a two-step process: First, the layer is marked as closed, so
that it no longer accepts new WAL records, and the layer map is updated
accordingly. If a new WAL record for that segment arrives after this step, a new
Open layer is created to hold it. After this first step, the layer is a Closed
InMemory state. This first step is called "freezing" the layer.
In the second step, new Delta and Image layers are created, containing all the
data in the Frozen InMemory layer. When the new layers are ready, the original
frozen layer is replaced with the new layers in the layer map, and the original
frozen layer is dropped, releasing the memory.
# Layer files (On-disk layers)
The files are called "layer files". Each layer file corresponds
to one RELISH_SEG_SIZE slice of a PostgreSQL relation fork or
@@ -313,6 +434,8 @@ is a newer layer file there. TODO: This optimization hasn't been
implemented! The GC algorithm will currently keep the file on the
'main' branch anyway, for as long as the child branch exists.
TODO:
Describe GC and checkpoint interval settings.
# TODO: On LSN ranges

View File

@@ -1,4 +1,5 @@
use std::{fs::File, io::Write};
use std::io::{Read, Write};
use std::os::unix::prelude::FileExt;
use anyhow::Result;
use bookfile::{BookWriter, BoundedReader, ChapterId, ChapterWriter};
@@ -10,36 +11,36 @@ pub struct BlobRange {
size: usize,
}
pub fn read_blob(reader: &BoundedReader<&'_ File>, range: &BlobRange) -> Result<Vec<u8>> {
pub fn read_blob<F: FileExt>(reader: &BoundedReader<&'_ F>, range: &BlobRange) -> Result<Vec<u8>> {
let mut buf = vec![0u8; range.size];
reader.read_exact_at(&mut buf, range.offset)?;
Ok(buf)
}
pub struct BlobWriter {
writer: ChapterWriter<File>,
pub struct BlobWriter<W> {
writer: ChapterWriter<W>,
offset: u64,
}
impl BlobWriter {
impl<W: Write> BlobWriter<W> {
// This function takes a BookWriter and creates a new chapter to ensure offset is 0.
pub fn new(book_writer: BookWriter<File>, chapter_id: impl Into<ChapterId>) -> Self {
pub fn new(book_writer: BookWriter<W>, chapter_id: impl Into<ChapterId>) -> Self {
let writer = book_writer.new_chapter(chapter_id);
Self { writer, offset: 0 }
}
pub fn write_blob(&mut self, blob: &[u8]) -> Result<BlobRange> {
self.writer.write_all(blob)?;
pub fn write_blob_from_reader(&mut self, r: &mut impl Read) -> Result<BlobRange> {
let len = std::io::copy(r, &mut self.writer)?;
let range = BlobRange {
offset: self.offset,
size: blob.len(),
size: len as usize,
};
self.offset += blob.len() as u64;
self.offset += len as u64;
Ok(range)
}
pub fn close(self) -> bookfile::Result<BookWriter<File>> {
pub fn close(self) -> bookfile::Result<BookWriter<W>> {
self.writer.close()
}
}

View File

@@ -39,27 +39,26 @@
//!
use crate::layered_repository::blob::BlobWriter;
use crate::layered_repository::filename::{DeltaFileName, PathOrConf};
use crate::layered_repository::page_versions::PageVersions;
use crate::layered_repository::storage_layer::{
Layer, PageReconstructData, PageReconstructResult, PageVersion, SegmentTag,
};
use crate::repository::WALRecord;
use crate::virtual_file::VirtualFile;
use crate::waldecoder;
use crate::PageServerConf;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{bail, Result};
use bytes::Bytes;
use anyhow::{bail, ensure, Result};
use log::*;
use serde::{Deserialize, Serialize};
use std::collections::BTreeMap;
use zenith_utils::vec_map::VecMap;
// avoid binding to Write (conflicts with std::io::Write)
// while being able to use std::fmt::Write's methods
use std::fmt::Write as _;
use std::fs;
use std::fs::File;
use std::io::Write;
use std::io::{BufWriter, Write};
use std::ops::Bound::Included;
use std::path::{Path, PathBuf};
use std::sync::{Arc, Mutex, MutexGuard};
use std::sync::{Mutex, MutexGuard};
use bookfile::{Book, BookWriter};
@@ -69,7 +68,7 @@ use zenith_utils::lsn::Lsn;
use super::blob::{read_blob, BlobRange};
// Magic constant to identify a Zenith delta file
static DELTA_FILE_MAGIC: u32 = 0x5A616E01;
pub const DELTA_FILE_MAGIC: u32 = 0x5A616E01;
/// Mapping from (block #, lsn) -> page/WAL record
/// byte ranges in PAGE_VERSIONS_CHAPTER
@@ -79,10 +78,34 @@ static PAGE_VERSION_METAS_CHAPTER: u64 = 1;
static PAGE_VERSIONS_CHAPTER: u64 = 2;
static REL_SIZES_CHAPTER: u64 = 3;
#[derive(Serialize, Deserialize)]
struct PageVersionMeta {
page_image_range: Option<BlobRange>,
record_range: Option<BlobRange>,
/// Contains the [`Summary`] struct
static SUMMARY_CHAPTER: u64 = 4;
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
struct Summary {
tenantid: ZTenantId,
timelineid: ZTimelineId,
seg: SegmentTag,
start_lsn: Lsn,
end_lsn: Lsn,
dropped: bool,
}
impl From<&DeltaLayer> for Summary {
fn from(layer: &DeltaLayer) -> Self {
Self {
tenantid: layer.tenantid,
timelineid: layer.timelineid,
seg: layer.seg,
start_lsn: layer.start_lsn,
end_lsn: layer.end_lsn,
dropped: layer.dropped,
}
}
}
///
@@ -109,9 +132,6 @@ pub struct DeltaLayer {
dropped: bool,
/// Predecessor layer
predecessor: Option<Arc<dyn Layer>>,
inner: Mutex<DeltaLayerInner>,
}
@@ -120,15 +140,21 @@ pub struct DeltaLayerInner {
/// loaded into memory yet.
loaded: bool,
book: Option<Book<VirtualFile>>,
/// All versions of all pages in the file are are kept here.
/// Indexed by block number and LSN.
page_version_metas: BTreeMap<(u32, Lsn), PageVersionMeta>,
page_version_metas: VecMap<(u32, Lsn), BlobRange>,
/// `relsizes` tracks the size of the relation at different points in time.
relsizes: BTreeMap<Lsn, u32>,
relsizes: VecMap<Lsn, u32>,
}
impl Layer for DeltaLayer {
fn get_tenant_id(&self) -> ZTenantId {
self.tenantid
}
fn get_timeline_id(&self) -> ZTimelineId {
self.timelineid
}
@@ -150,15 +176,7 @@ impl Layer for DeltaLayer {
}
fn filename(&self) -> PathBuf {
PathBuf::from(
DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
}
.to_string(),
)
PathBuf::from(self.layer_name().to_string())
}
/// Look up given page in the cache.
@@ -166,47 +184,63 @@ impl Layer for DeltaLayer {
&self,
blknum: u32,
lsn: Lsn,
cached_img_lsn: Option<Lsn>,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult> {
let mut cont_lsn: Option<Lsn> = Some(lsn);
let mut need_image = true;
assert!(self.seg.blknum_in_seg(blknum));
match &cached_img_lsn {
Some(cached_lsn) if &self.end_lsn <= cached_lsn => {
return Ok(PageReconstructResult::Cached)
}
_ => {}
}
{
// Open the file and lock the metadata in memory
// TODO: avoid opening the file for each read
let (_path, book) = self.open_book()?;
let page_version_reader = book.chapter_reader(PAGE_VERSIONS_CHAPTER)?;
let inner = self.load()?;
let page_version_reader = inner
.book
.as_ref()
.expect("should be loaded in load call above")
.chapter_reader(PAGE_VERSIONS_CHAPTER)?;
// Scan the metadata BTreeMap backwards, starting from the given entry.
let minkey = (blknum, Lsn(0));
let maxkey = (blknum, lsn);
let mut iter = inner
let iter = inner
.page_version_metas
.range((Included(&minkey), Included(&maxkey)));
while let Some(((_blknum, entry_lsn), entry)) = iter.next_back() {
if let Some(img_range) = &entry.page_image_range {
// Found a page image, return it
let img = Bytes::from(read_blob(&page_version_reader, img_range)?);
reconstruct_data.page_img = Some(img);
cont_lsn = None;
break;
} else if let Some(rec_range) = &entry.record_range {
let rec = WALRecord::des(&read_blob(&page_version_reader, rec_range)?)?;
let will_init = rec.will_init;
reconstruct_data.records.push(rec);
if will_init {
// This WAL record initializes the page, so no need to go further back
cont_lsn = None;
break;
} else {
// This WAL record needs to be applied against an older page image
cont_lsn = Some(*entry_lsn);
.slice_range((Included(&minkey), Included(&maxkey)))
.iter()
.rev();
for ((_blknum, pv_lsn), blob_range) in iter {
match &cached_img_lsn {
Some(cached_lsn) if pv_lsn <= cached_lsn => {
return Ok(PageReconstructResult::Cached)
}
_ => {}
}
let pv = PageVersion::des(&read_blob(&page_version_reader, blob_range)?)?;
match pv {
PageVersion::Page(img) => {
// Found a page image, return it
reconstruct_data.page_img = Some(img);
need_image = false;
break;
}
PageVersion::Wal(rec) => {
let will_init = rec.will_init;
reconstruct_data.records.push((*pv_lsn, rec));
if will_init {
// This WAL record initializes the page, so no need to go further back
need_image = false;
break;
}
}
} else {
// No base image, and no WAL record. Huh?
bail!("no page image or WAL record for requested page");
}
}
@@ -214,16 +248,9 @@ impl Layer for DeltaLayer {
}
// If an older page image is needed to reconstruct the page, let the
// caller know about the predecessor layer.
if let Some(cont_lsn) = cont_lsn {
if let Some(cont_layer) = &self.predecessor {
Ok(PageReconstructResult::Continue(
cont_lsn,
Arc::clone(cont_layer),
))
} else {
Ok(PageReconstructResult::Missing(cont_lsn))
}
// caller know.
if need_image {
Ok(PageReconstructResult::Continue(Lsn(self.start_lsn.0 - 1)))
} else {
Ok(PageReconstructResult::Complete)
}
@@ -232,21 +259,22 @@ impl Layer for DeltaLayer {
/// Get size of the relation at given LSN
fn get_seg_size(&self, lsn: Lsn) -> Result<u32> {
assert!(lsn >= self.start_lsn);
ensure!(
self.seg.rel.is_blocky(),
"get_seg_size() called on a non-blocky rel"
);
// Scan the BTreeMap backwards, starting from the given entry.
let inner = self.load()?;
let mut iter = inner.relsizes.range((Included(&Lsn(0)), Included(&lsn)));
let slice = inner
.relsizes
.slice_range((Included(&Lsn(0)), Included(&lsn)));
let result;
if let Some((_entry_lsn, entry)) = iter.next_back() {
result = *entry;
// Use the base image if needed
} else if let Some(predecessor) = &self.predecessor {
result = predecessor.get_seg_size(lsn)?;
if let Some((_entry_lsn, entry)) = slice.last() {
Ok(*entry)
} else {
result = 0;
Err(anyhow::anyhow!("could not find seg size in delta layer"))
}
Ok(result)
}
/// Does this segment exist at given LSN?
@@ -266,9 +294,14 @@ impl Layer for DeltaLayer {
///
fn unload(&self) -> Result<()> {
let mut inner = self.inner.lock().unwrap();
inner.page_version_metas = BTreeMap::new();
inner.relsizes = BTreeMap::new();
inner.page_version_metas = VecMap::default();
inner.relsizes = VecMap::default();
inner.loaded = false;
// Note: we keep the Book open. Is that a good idea? The virtual file
// machinery has its own rules for closing the file descriptor if it's not
// needed, but the Book struct uses up some memory, too.
Ok(())
}
@@ -282,41 +315,52 @@ impl Layer for DeltaLayer {
true
}
fn is_in_memory(&self) -> bool {
false
}
/// debugging function to print out the contents of the layer
fn dump(&self) -> Result<()> {
println!(
"----- delta layer for tli {} seg {} {}-{} ----",
self.timelineid, self.seg, self.start_lsn, self.end_lsn
"----- delta layer for ten {} tli {} seg {} {}-{} ----",
self.tenantid, self.timelineid, self.seg, self.start_lsn, self.end_lsn
);
println!("--- relsizes ---");
let inner = self.load()?;
for (k, v) in inner.relsizes.iter() {
for (k, v) in inner.relsizes.as_slice() {
println!(" {}: {}", k, v);
}
println!("--- page versions ---");
let (_path, book) = self.open_book()?;
let path = self.path();
let file = std::fs::File::open(&path)?;
let book = Book::new(file)?;
let chapter = book.chapter_reader(PAGE_VERSIONS_CHAPTER)?;
for (k, v) in inner.page_version_metas.iter() {
for ((blk, lsn), blob_range) in inner.page_version_metas.as_slice() {
let mut desc = String::new();
if let Some(page_image_range) = v.page_image_range.as_ref() {
let image = read_blob(&chapter, &page_image_range)?;
write!(&mut desc, " img {} bytes", image.len())?;
let buf = read_blob(&chapter, blob_range)?;
let pv = PageVersion::des(&buf)?;
match pv {
PageVersion::Page(img) => {
write!(&mut desc, " img {} bytes", img.len())?;
}
PageVersion::Wal(rec) => {
let wal_desc = waldecoder::describe_wal_record(&rec.rec);
write!(
&mut desc,
" rec {} bytes will_init: {} {}",
rec.rec.len(),
rec.will_init,
wal_desc
)?;
}
}
if let Some(record_range) = v.record_range.as_ref() {
let record_bytes = read_blob(&chapter, record_range)?;
let rec = WALRecord::des(&record_bytes)?;
let wal_desc = waldecoder::describe_wal_record(&rec.rec);
write!(
&mut desc,
" rec {} bytes will_init: {} {}",
rec.rec.len(),
rec.will_init,
wal_desc
)?;
}
println!(" blk {} at {}: {}", k.0, k.1, desc);
println!(" blk {} at {}: {}", blk, lsn, desc);
}
Ok(())
@@ -324,20 +368,6 @@ impl Layer for DeltaLayer {
}
impl DeltaLayer {
fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
},
)
}
fn path_for(
path_or_conf: &PathOrConf,
timelineid: ZTimelineId,
@@ -352,12 +382,13 @@ impl DeltaLayer {
}
}
/// Create a new delta file, using the given btreemaps containing the page versions and
/// relsizes.
/// Create a new delta file, using the given page versions and relsizes.
/// The page versions are passed in a PageVersions struct. If 'cutoff' is
/// given, only page versions with LSN < cutoff are included.
///
/// This is used to write the in-memory layer to disk. The in-memory layer uses the same
/// data structure with two btreemaps as we do, so passing the btreemaps is currently
/// expedient.
/// This is used to write the in-memory layer to disk. The page_versions and
/// relsizes are thus passed in the same format as they are in the in-memory
/// layer, as that's expedient.
#[allow(clippy::too_many_arguments)]
pub fn create(
conf: &'static PageServerConf,
@@ -367,10 +398,14 @@ impl DeltaLayer {
start_lsn: Lsn,
end_lsn: Lsn,
dropped: bool,
predecessor: Option<Arc<dyn Layer>>,
page_versions: BTreeMap<(u32, Lsn), PageVersion>,
relsizes: BTreeMap<Lsn, u32>,
page_versions: &PageVersions,
cutoff: Option<Lsn>,
relsizes: VecMap<Lsn, u32>,
) -> Result<DeltaLayer> {
if seg.rel.is_blocky() {
assert!(!relsizes.is_empty());
}
let delta_layer = DeltaLayer {
path_or_conf: PathOrConf::Conf(conf),
timelineid,
@@ -380,64 +415,71 @@ impl DeltaLayer {
end_lsn,
dropped,
inner: Mutex::new(DeltaLayerInner {
loaded: true,
page_version_metas: BTreeMap::new(),
loaded: false,
book: None,
page_version_metas: VecMap::default(),
relsizes,
}),
predecessor,
};
let mut inner = delta_layer.inner.lock().unwrap();
// Write the in-memory btreemaps into a file
let path = delta_layer.path();
// Write the data into a file
//
// Note: Because we open the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.book here. The first read will have to re-open it.
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let file = File::create(&path)?;
let book = BookWriter::new(file, DELTA_FILE_MAGIC)?;
let path = delta_layer.path();
let file = VirtualFile::create(&path)?;
let buf_writer = BufWriter::new(file);
let book = BookWriter::new(buf_writer, DELTA_FILE_MAGIC)?;
let mut page_version_writer = BlobWriter::new(book, PAGE_VERSIONS_CHAPTER);
for (key, page_version) in page_versions {
let page_image_range = page_version
.page_image
.map(|page_image| page_version_writer.write_blob(page_image.as_ref()))
.transpose()?;
let page_versions_iter = page_versions.ordered_page_version_iter(cutoff);
for (blknum, lsn, pos) in page_versions_iter {
let blob_range =
page_version_writer.write_blob_from_reader(&mut page_versions.reader(pos)?)?;
let record_range = page_version
.record
.map(|record| {
let buf = WALRecord::ser(&record)?;
page_version_writer.write_blob(&buf)
})
.transpose()?;
let old = inner.page_version_metas.insert(
key,
PageVersionMeta {
page_image_range,
record_range,
},
);
assert!(old.is_none());
inner
.page_version_metas
.append((blknum, lsn), blob_range)
.unwrap();
}
let book = page_version_writer.close()?;
// Write out page versions
let mut chapter = book.new_chapter(PAGE_VERSION_METAS_CHAPTER);
let buf = BTreeMap::ser(&inner.page_version_metas)?;
let buf = VecMap::ser(&inner.page_version_metas)?;
chapter.write_all(&buf)?;
let book = chapter.close()?;
// and relsizes to separate chapter
let mut chapter = book.new_chapter(REL_SIZES_CHAPTER);
let buf = BTreeMap::ser(&inner.relsizes)?;
let buf = VecMap::ser(&inner.relsizes)?;
chapter.write_all(&buf)?;
let book = chapter.close()?;
book.close()?;
let mut chapter = book.new_chapter(SUMMARY_CHAPTER);
let summary = Summary {
tenantid,
timelineid,
seg,
start_lsn,
end_lsn,
dropped,
};
Summary::ser_into(&summary, &mut chapter)?;
let book = chapter.close()?;
// This flushes the underlying 'buf_writer'.
let writer = book.close()?;
writer.get_ref().sync_all()?;
trace!("saved {}", &path.display());
@@ -446,25 +488,6 @@ impl DeltaLayer {
Ok(delta_layer)
}
fn open_book(&self) -> Result<(PathBuf, Book<File>)> {
let path = Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
},
);
let file = File::open(&path)?;
let book = Book::new(file)?;
Ok((path, book))
}
///
/// Load the contents of the file into memory
///
@@ -476,21 +499,51 @@ impl DeltaLayer {
return Ok(inner);
}
let (path, book) = self.open_book()?;
let path = self.path();
// Open the file if it's not open already.
if inner.book.is_none() {
let file = VirtualFile::open(&path)?;
inner.book = Some(Book::new(file)?);
}
let book = inner.book.as_ref().unwrap();
match &self.path_or_conf {
PathOrConf::Conf(_) => {
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let actual_summary = Summary::des(&chapter)?;
let expected_summary = Summary::from(self);
if actual_summary != expected_summary {
bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary);
}
}
PathOrConf::Path(path) => {
let actual_filename = Path::new(path.file_name().unwrap());
let expected_filename = self.filename();
if actual_filename != expected_filename {
println!(
"warning: filename does not match what is expected from in-file summary"
);
println!("actual: {:?}", actual_filename);
println!("expected: {:?}", expected_filename);
}
}
}
let chapter = book.read_chapter(PAGE_VERSION_METAS_CHAPTER)?;
let page_version_metas = BTreeMap::des(&chapter)?;
let page_version_metas = VecMap::des(&chapter)?;
let chapter = book.read_chapter(REL_SIZES_CHAPTER)?;
let relsizes = BTreeMap::des(&chapter)?;
let relsizes = VecMap::des(&chapter)?;
debug!("loaded from {}", &path.display());
*inner = DeltaLayerInner {
loaded: true,
page_version_metas,
relsizes,
};
inner.page_version_metas = page_version_metas;
inner.relsizes = relsizes;
inner.loaded = true;
Ok(inner)
}
@@ -501,7 +554,6 @@ impl DeltaLayer {
timelineid: ZTimelineId,
tenantid: ZTenantId,
filename: &DeltaFileName,
predecessor: Option<Arc<dyn Layer>>,
) -> DeltaLayer {
DeltaLayer {
path_or_conf: PathOrConf::Conf(conf),
@@ -513,36 +565,56 @@ impl DeltaLayer {
dropped: filename.dropped,
inner: Mutex::new(DeltaLayerInner {
loaded: false,
page_version_metas: BTreeMap::new(),
relsizes: BTreeMap::new(),
book: None,
page_version_metas: VecMap::default(),
relsizes: VecMap::default(),
}),
predecessor,
}
}
/// Create a DeltaLayer struct representing an existing file on disk.
///
/// This variant is only used for debugging purposes, by the 'dump_layerfile' binary.
pub fn new_for_path(
path: &Path,
timelineid: ZTimelineId,
tenantid: ZTenantId,
filename: &DeltaFileName,
) -> DeltaLayer {
DeltaLayer {
pub fn new_for_path<F>(path: &Path, book: &Book<F>) -> Result<Self>
where
F: std::os::unix::prelude::FileExt,
{
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let summary = Summary::des(&chapter)?;
Ok(DeltaLayer {
path_or_conf: PathOrConf::Path(path.to_path_buf()),
timelineid,
tenantid,
seg: filename.seg,
start_lsn: filename.start_lsn,
end_lsn: filename.end_lsn,
dropped: filename.dropped,
timelineid: summary.timelineid,
tenantid: summary.tenantid,
seg: summary.seg,
start_lsn: summary.start_lsn,
end_lsn: summary.end_lsn,
dropped: summary.dropped,
inner: Mutex::new(DeltaLayerInner {
loaded: false,
page_version_metas: BTreeMap::new(),
relsizes: BTreeMap::new(),
book: None,
page_version_metas: VecMap::default(),
relsizes: VecMap::default(),
}),
predecessor: None,
})
}
fn layer_name(&self) -> DeltaFileName {
DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn: self.end_lsn,
dropped: self.dropped,
}
}
/// Path to the layer file in pageserver workdir.
pub fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&self.layer_name(),
)
}
}

View File

@@ -0,0 +1,298 @@
//! Implementation of append-only file data structure
//! used to keep in-memory layers spilled on disk.
use crate::page_cache;
use crate::page_cache::PAGE_SZ;
use crate::page_cache::{ReadBufResult, WriteBufResult};
use crate::virtual_file::VirtualFile;
use crate::PageServerConf;
use lazy_static::lazy_static;
use std::cmp::min;
use std::collections::HashMap;
use std::fs::OpenOptions;
use std::io::{Error, ErrorKind, Seek, SeekFrom, Write};
use std::ops::DerefMut;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
use std::os::unix::fs::FileExt;
lazy_static! {
///
/// This is the global cache of file descriptors (File objects).
///
static ref EPHEMERAL_FILES: RwLock<EphemeralFiles> = RwLock::new(EphemeralFiles {
next_file_id: 1,
files: HashMap::new(),
});
}
pub struct EphemeralFiles {
next_file_id: u64,
files: HashMap<u64, Arc<VirtualFile>>,
}
pub struct EphemeralFile {
file_id: u64,
_tenantid: ZTenantId,
_timelineid: ZTimelineId,
file: Arc<VirtualFile>,
pos: u64,
}
impl EphemeralFile {
pub fn create(
conf: &PageServerConf,
tenantid: ZTenantId,
timelineid: ZTimelineId,
) -> Result<EphemeralFile, std::io::Error> {
let mut l = EPHEMERAL_FILES.write().unwrap();
let file_id = l.next_file_id;
l.next_file_id += 1;
let filename = conf
.timeline_path(&timelineid, &tenantid)
.join(PathBuf::from(format!("ephemeral-{}", file_id)));
let file = VirtualFile::open_with_options(
&filename,
OpenOptions::new().read(true).write(true).create(true),
)?;
let file_rc = Arc::new(file);
l.files.insert(file_id, file_rc.clone());
Ok(EphemeralFile {
file_id,
_tenantid: tenantid,
_timelineid: timelineid,
file: file_rc,
pos: 0,
})
}
pub fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), Error> {
let mut off = 0;
while off < PAGE_SZ {
let n = self
.file
.read_at(&mut buf[off..], blkno as u64 * PAGE_SZ as u64 + off as u64)?;
if n == 0 {
// Reached EOF. Fill the rest of the buffer with zeros.
const ZERO_BUF: [u8; PAGE_SZ] = [0u8; PAGE_SZ];
buf[off..].copy_from_slice(&ZERO_BUF[off..]);
break;
}
off += n as usize;
}
Ok(())
}
}
impl FileExt for EphemeralFile {
fn read_at(&self, dstbuf: &mut [u8], offset: u64) -> Result<usize, Error> {
// Look up the right page
let blkno = (offset / PAGE_SZ as u64) as u32;
let off = offset as usize % PAGE_SZ;
let len = min(PAGE_SZ - off, dstbuf.len());
let read_guard;
let mut write_guard;
let cache = page_cache::get();
let buf = match cache.read_ephemeral_buf(self.file_id, blkno) {
ReadBufResult::Found(guard) => {
read_guard = guard;
read_guard.as_ref()
}
ReadBufResult::NotFound(guard) => {
// Read the page from disk into the buffer
write_guard = guard;
self.fill_buffer(write_guard.deref_mut(), blkno)?;
write_guard.mark_valid();
// And then fall through to read the requested slice from the
// buffer.
write_guard.as_ref()
}
};
dstbuf[0..len].copy_from_slice(&buf[off..(off + len)]);
Ok(len)
}
fn write_at(&self, srcbuf: &[u8], offset: u64) -> Result<usize, Error> {
// Look up the right page
let blkno = (offset / PAGE_SZ as u64) as u32;
let off = offset as usize % PAGE_SZ;
let len = min(PAGE_SZ - off, srcbuf.len());
let mut write_guard;
let cache = page_cache::get();
let buf = match cache.write_ephemeral_buf(self.file_id, blkno) {
WriteBufResult::Found(guard) => {
write_guard = guard;
write_guard.deref_mut()
}
WriteBufResult::NotFound(guard) => {
// Read the page from disk into the buffer
// TODO: if we're overwriting the whole page, no need to read it in first
write_guard = guard;
self.fill_buffer(write_guard.deref_mut(), blkno)?;
write_guard.mark_valid();
// And then fall through to modify it.
write_guard.deref_mut()
}
};
buf[off..(off + len)].copy_from_slice(&srcbuf[0..len]);
write_guard.mark_dirty();
Ok(len)
}
}
impl Write for EphemeralFile {
fn write(&mut self, buf: &[u8]) -> Result<usize, Error> {
let n = self.write_at(buf, self.pos)?;
self.pos += n as u64;
Ok(n)
}
fn flush(&mut self) -> Result<(), std::io::Error> {
todo!()
}
}
impl Seek for EphemeralFile {
fn seek(&mut self, pos: SeekFrom) -> Result<u64, Error> {
match pos {
SeekFrom::Start(offset) => {
self.pos = offset;
}
SeekFrom::End(_offset) => {
return Err(Error::new(
ErrorKind::Other,
"SeekFrom::End not supported by EphemeralFile",
));
}
SeekFrom::Current(offset) => {
let pos = self.pos as i128 + offset as i128;
if pos < 0 {
return Err(Error::new(
ErrorKind::InvalidInput,
"offset would be negative",
));
}
if pos > u64::MAX as i128 {
return Err(Error::new(ErrorKind::InvalidInput, "offset overflow"));
}
self.pos = pos as u64;
}
}
Ok(self.pos)
}
}
impl Drop for EphemeralFile {
fn drop(&mut self) {
// drop all pages from page cache
let cache = page_cache::get();
cache.drop_buffers_for_ephemeral(self.file_id);
// remove entry from the hash map
EPHEMERAL_FILES.write().unwrap().files.remove(&self.file_id);
// unlink file
// FIXME: print error
let _ = std::fs::remove_file(&self.file.path);
}
}
pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), std::io::Error> {
if let Some(file) = EPHEMERAL_FILES.read().unwrap().files.get(&file_id) {
file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64)?;
Ok(())
} else {
Err(std::io::Error::new(
ErrorKind::Other,
"could not write back page, not found in ephemeral files hash",
))
}
}
#[cfg(test)]
mod tests {
use super::*;
use rand::seq::SliceRandom;
use rand::thread_rng;
use std::fs;
use std::str::FromStr;
fn repo_harness(
test_name: &str,
) -> Result<(&'static PageServerConf, ZTenantId, ZTimelineId), Error> {
let repo_dir = PageServerConf::test_repo_dir(test_name);
let _ = fs::remove_dir_all(&repo_dir);
let conf = PageServerConf::dummy_conf(repo_dir);
// Make a static copy of the config. This can never be free'd, but that's
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
let tenantid = ZTenantId::from_str("11000000000000000000000000000000").unwrap();
let timelineid = ZTimelineId::from_str("22000000000000000000000000000000").unwrap();
fs::create_dir_all(conf.timeline_path(&timelineid, &tenantid))?;
Ok((conf, tenantid, timelineid))
}
// Helper function to slurp contents of a file, starting at the current position,
// into a string
fn read_string(efile: &EphemeralFile, offset: u64, len: usize) -> Result<String, Error> {
let mut buf = Vec::new();
buf.resize(len, 0u8);
efile.read_exact_at(&mut buf, offset)?;
Ok(String::from_utf8_lossy(&buf)
.trim_end_matches('\0')
.to_string())
}
#[test]
fn test_ephemeral_files() -> Result<(), Error> {
let (conf, tenantid, timelineid) = repo_harness("ephemeral_files")?;
let mut file_a = EphemeralFile::create(conf, tenantid, timelineid)?;
file_a.write_all(b"foo")?;
assert_eq!("foo", read_string(&file_a, 0, 20)?);
file_a.write_all(b"bar")?;
assert_eq!("foobar", read_string(&file_a, 0, 20)?);
// Open a lot of files, enough to cause some page evictions.
let mut efiles = Vec::new();
for fileno in 0..100 {
let mut efile = EphemeralFile::create(conf, tenantid, timelineid)?;
efile.write_all(format!("file {}", fileno).as_bytes())?;
assert_eq!(format!("file {}", fileno), read_string(&efile, 0, 10)?);
efiles.push((fileno, efile));
}
// Check that all the files can still be read from. Use them in random order for
// good measure.
efiles.as_mut_slice().shuffle(&mut thread_rng());
for (fileno, efile) in efiles.iter_mut() {
assert_eq!(format!("file {}", fileno), read_string(efile, 0, 10)?);
}
Ok(())
}
}

View File

@@ -13,6 +13,8 @@ use anyhow::Result;
use log::*;
use zenith_utils::lsn::Lsn;
use super::metadata::METADATA_FILE_NAME;
// Note: LayeredTimeline::load_layer_map() relies on this sort order
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub struct DeltaFileName {
@@ -35,7 +37,7 @@ impl DeltaFileName {
/// Parse a string as a delta file name. Returns None if the filename does not
/// match the expected pattern.
///
pub fn from_str(fname: &str) -> Option<Self> {
pub fn parse_str(fname: &str) -> Option<Self> {
let rel;
let mut parts;
if let Some(rest) = fname.strip_prefix("rel_") {
@@ -168,7 +170,7 @@ impl ImageFileName {
/// Parse a string as an image file name. Returns None if the filename does not
/// match the expected pattern.
///
pub fn from_str(fname: &str) -> Option<Self> {
pub fn parse_str(fname: &str) -> Option<Self> {
let rel;
let mut parts;
if let Some(rest) = fname.strip_prefix("rel_") {
@@ -286,11 +288,11 @@ pub fn list_files(
let fname = direntry?.file_name();
let fname = fname.to_str().unwrap();
if let Some(deltafilename) = DeltaFileName::from_str(fname) {
if let Some(deltafilename) = DeltaFileName::parse_str(fname) {
deltafiles.push(deltafilename);
} else if let Some(imgfilename) = ImageFileName::from_str(fname) {
} else if let Some(imgfilename) = ImageFileName::parse_str(fname) {
imgfiles.push(imgfilename);
} else if fname == "wal" || fname == "metadata" || fname == "ancestor" {
} else if fname == METADATA_FILE_NAME || fname.ends_with(".old") {
// ignore these
} else {
warn!("unrecognized filename in timeline dir: {}", fname);

View File

@@ -0,0 +1,142 @@
//!
//! Global registry of open layers.
//!
//! Whenever a new in-memory layer is created to hold incoming WAL, it is registered
//! in [`GLOBAL_LAYER_MAP`], so that we can keep track of the total number of
//! in-memory layers in the system, and know when we need to evict some to release
//! memory.
//!
//! Each layer is assigned a unique ID when it's registered in the global registry.
//! The ID can be used to relocate the layer later, without having to hold locks.
//!
use std::sync::atomic::{AtomicU8, Ordering};
use std::sync::{Arc, RwLock};
use super::inmemory_layer::InMemoryLayer;
use lazy_static::lazy_static;
const MAX_USAGE_COUNT: u8 = 5;
lazy_static! {
pub static ref GLOBAL_LAYER_MAP: RwLock<InMemoryLayers> =
RwLock::new(InMemoryLayers::default());
}
// TODO these types can probably be smaller
#[derive(PartialEq, Eq, Clone, Copy)]
pub struct LayerId {
index: usize,
tag: u64, // to avoid ABA problem
}
enum SlotData {
Occupied(Arc<InMemoryLayer>),
/// Vacant slots form a linked list, the value is the index
/// of the next vacant slot in the list.
Vacant(Option<usize>),
}
struct Slot {
tag: u64,
data: SlotData,
usage_count: AtomicU8, // for clock algorithm
}
#[derive(Default)]
pub struct InMemoryLayers {
slots: Vec<Slot>,
num_occupied: usize,
// Head of free-slot list.
next_empty_slot_idx: Option<usize>,
}
impl InMemoryLayers {
pub fn insert(&mut self, layer: Arc<InMemoryLayer>) -> LayerId {
let slot_idx = match self.next_empty_slot_idx {
Some(slot_idx) => slot_idx,
None => {
let idx = self.slots.len();
self.slots.push(Slot {
tag: 0,
data: SlotData::Vacant(None),
usage_count: AtomicU8::new(0),
});
idx
}
};
let slots_len = self.slots.len();
let slot = &mut self.slots[slot_idx];
match slot.data {
SlotData::Occupied(_) => {
panic!("an occupied slot was in the free list");
}
SlotData::Vacant(next_empty_slot_idx) => {
self.next_empty_slot_idx = next_empty_slot_idx;
}
}
slot.data = SlotData::Occupied(layer);
slot.usage_count.store(1, Ordering::Relaxed);
self.num_occupied += 1;
assert!(self.num_occupied <= slots_len);
LayerId {
index: slot_idx,
tag: slot.tag,
}
}
pub fn get(&self, layer_id: &LayerId) -> Option<Arc<InMemoryLayer>> {
let slot = self.slots.get(layer_id.index)?; // TODO should out of bounds indexes just panic?
if slot.tag != layer_id.tag {
return None;
}
if let SlotData::Occupied(layer) = &slot.data {
let _ = slot.usage_count.fetch_update(
Ordering::Relaxed,
Ordering::Relaxed,
|old_usage_count| {
if old_usage_count < MAX_USAGE_COUNT {
Some(old_usage_count + 1)
} else {
None
}
},
);
Some(Arc::clone(layer))
} else {
None
}
}
// TODO this won't be a public API in the future
pub fn remove(&mut self, layer_id: &LayerId) {
let slot = &mut self.slots[layer_id.index];
if slot.tag != layer_id.tag {
return;
}
match &slot.data {
SlotData::Occupied(_layer) => {
// TODO evict the layer
}
SlotData::Vacant(_) => unimplemented!(),
}
slot.data = SlotData::Vacant(self.next_empty_slot_idx);
self.next_empty_slot_idx = Some(layer_id.index);
assert!(self.num_occupied > 0);
self.num_occupied -= 1;
slot.tag = slot.tag.wrapping_add(1);
}
}

View File

@@ -27,29 +27,55 @@ use crate::layered_repository::storage_layer::{
};
use crate::layered_repository::LayeredTimeline;
use crate::layered_repository::RELISH_SEG_SIZE;
use crate::virtual_file::VirtualFile;
use crate::PageServerConf;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{anyhow, ensure, Result};
use anyhow::{anyhow, bail, ensure, Context, Result};
use bytes::Bytes;
use log::*;
use serde::{Deserialize, Serialize};
use std::convert::TryInto;
use std::fs;
use std::fs::File;
use std::io::Write;
use std::io::{BufWriter, Write};
use std::path::{Path, PathBuf};
use std::sync::{Mutex, MutexGuard};
use bookfile::{Book, BookWriter};
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
// Magic constant to identify a Zenith segment image file
const IMAGE_FILE_MAGIC: u32 = 0x5A616E01 + 1;
pub const IMAGE_FILE_MAGIC: u32 = 0x5A616E01 + 1;
/// Contains each block in block # order
const BLOCKY_IMAGES_CHAPTER: u64 = 1;
const NONBLOCKY_IMAGE_CHAPTER: u64 = 2;
/// Contains the [`Summary`] struct
const SUMMARY_CHAPTER: u64 = 3;
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq)]
struct Summary {
tenantid: ZTenantId,
timelineid: ZTimelineId,
seg: SegmentTag,
lsn: Lsn,
}
impl From<&ImageLayer> for Summary {
fn from(layer: &ImageLayer) -> Self {
Self {
tenantid: layer.tenantid,
timelineid: layer.timelineid,
seg: layer.seg,
lsn: layer.lsn,
}
}
}
const BLOCK_SIZE: usize = 8192;
///
@@ -78,9 +104,8 @@ enum ImageType {
}
pub struct ImageLayerInner {
/// If false, the 'image_type' has not been
/// loaded into memory yet.
loaded: bool,
/// If None, the 'image_type' has not been loaded into memory yet.
book: Option<Book<VirtualFile>>,
/// Derived from filename and bookfile chapter metadata
image_type: ImageType,
@@ -88,13 +113,11 @@ pub struct ImageLayerInner {
impl Layer for ImageLayer {
fn filename(&self) -> PathBuf {
PathBuf::from(
ImageFileName {
seg: self.seg,
lsn: self.lsn,
}
.to_string(),
)
PathBuf::from(self.layer_name().to_string())
}
fn get_tenant_id(&self) -> ZTenantId {
self.tenantid
}
fn get_timeline_id(&self) -> ZTimelineId {
@@ -114,7 +137,8 @@ impl Layer for ImageLayer {
}
fn get_end_lsn(&self) -> Lsn {
self.lsn
// End-bound is exclusive
self.lsn + 1
}
/// Look up given page in the file
@@ -122,16 +146,20 @@ impl Layer for ImageLayer {
&self,
blknum: u32,
lsn: Lsn,
cached_img_lsn: Option<Lsn>,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult> {
assert!(lsn >= self.lsn);
match cached_img_lsn {
Some(cached_lsn) if self.lsn <= cached_lsn => return Ok(PageReconstructResult::Cached),
_ => {}
}
let inner = self.load()?;
let base_blknum = blknum % RELISH_SEG_SIZE;
let (_path, book) = self.open_book()?;
let buf = match &inner.image_type {
ImageType::Blocky { num_blocks } => {
if base_blknum >= *num_blocks {
@@ -141,14 +169,23 @@ impl Layer for ImageLayer {
let mut buf = vec![0u8; BLOCK_SIZE];
let offset = BLOCK_SIZE as u64 * base_blknum as u64;
let chapter = book.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
let chapter = inner
.book
.as_ref()
.unwrap()
.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
chapter.read_exact_at(&mut buf, offset)?;
buf
}
ImageType::NonBlocky => {
ensure!(base_blknum == 0);
book.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?.into_vec()
inner
.book
.as_ref()
.unwrap()
.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?
.into_vec()
}
};
@@ -170,14 +207,7 @@ impl Layer for ImageLayer {
Ok(true)
}
///
/// Release most of the memory used by this layer. If it's accessed again later,
/// it will need to be loaded back.
///
fn unload(&self) -> Result<()> {
let mut inner = self.inner.lock().unwrap();
inner.image_type = ImageType::Blocky { num_blocks: 0 };
inner.loaded = false;
Ok(())
}
@@ -191,11 +221,15 @@ impl Layer for ImageLayer {
false
}
fn is_in_memory(&self) -> bool {
false
}
/// debugging function to print out the contents of the layer
fn dump(&self) -> Result<()> {
println!(
"----- image layer for tli {} seg {} at {} ----",
self.timelineid, self.seg, self.lsn
"----- image layer for ten {} tli {} seg {} at {} ----",
self.tenantid, self.timelineid, self.seg, self.lsn
);
let inner = self.load()?;
@@ -203,8 +237,11 @@ impl Layer for ImageLayer {
match inner.image_type {
ImageType::Blocky { num_blocks } => println!("({}) blocks ", num_blocks),
ImageType::NonBlocky => {
let (_path, book) = self.open_book()?;
let chapter = book.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?;
let chapter = inner
.book
.as_ref()
.unwrap()
.read_chapter(NONBLOCKY_IMAGE_CHAPTER)?;
println!("non-blocky ({} bytes)", chapter.len());
}
}
@@ -214,18 +251,6 @@ impl Layer for ImageLayer {
}
impl ImageLayer {
fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&ImageFileName {
seg: self.seg,
lsn: self.lsn,
},
)
}
fn path_for(
path_or_conf: &PathOrConf,
timelineid: ZTimelineId,
@@ -264,19 +289,24 @@ impl ImageLayer {
seg,
lsn,
inner: Mutex::new(ImageLayerInner {
loaded: true,
book: None,
image_type: image_type.clone(),
}),
};
let inner = layer.inner.lock().unwrap();
// Write the images into a file
let path = layer.path();
//
// Note: Because we open the file in write-only mode, we cannot
// reuse the same VirtualFile for reading later. That's why we don't
// set inner.book here. The first read will have to re-open it.
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let file = File::create(&path)?;
let book = BookWriter::new(file, IMAGE_FILE_MAGIC)?;
let path = layer.path();
let file = VirtualFile::create(&path)?;
let buf_writer = BufWriter::new(file);
let book = BookWriter::new(buf_writer, IMAGE_FILE_MAGIC)?;
let book = match &image_type {
ImageType::Blocky { .. } => {
@@ -294,9 +324,22 @@ impl ImageLayer {
}
};
book.close()?;
let mut chapter = book.new_chapter(SUMMARY_CHAPTER);
let summary = Summary {
tenantid,
timelineid,
seg,
trace!("saved {}", &path.display());
lsn,
};
Summary::ser_into(&summary, &mut chapter)?;
let book = chapter.close()?;
// This flushes the underlying 'buf_writer'.
let writer = book.close()?;
writer.get_ref().sync_all()?;
trace!("saved {}", path.display());
drop(inner);
@@ -348,11 +391,44 @@ impl ImageLayer {
// quick exit if already loaded
let mut inner = self.inner.lock().unwrap();
if inner.loaded {
if inner.book.is_some() {
return Ok(inner);
}
let (path, book) = self.open_book()?;
let path = self.path();
let file = VirtualFile::open(&path)
.with_context(|| format!("Failed to open virtual file '{}'", path.display()))?;
let book = Book::new(file).with_context(|| {
format!(
"Failed to open virtual file '{}' as a bookfile",
path.display()
)
})?;
match &self.path_or_conf {
PathOrConf::Conf(_) => {
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let actual_summary = Summary::des(&chapter)?;
let expected_summary = Summary::from(self);
if actual_summary != expected_summary {
bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary);
}
}
PathOrConf::Path(path) => {
let actual_filename = Path::new(path.file_name().unwrap());
let expected_filename = self.filename();
if actual_filename != expected_filename {
println!(
"warning: filename does not match what is expected from in-file summary"
);
println!("actual: {:?}", actual_filename);
println!("expected: {:?}", expected_filename);
}
}
}
let image_type = if self.seg.rel.is_blocky() {
let chapter = book.chapter_reader(BLOCKY_IMAGES_CHAPTER)?;
@@ -368,30 +444,13 @@ impl ImageLayer {
debug!("loaded from {}", &path.display());
*inner = ImageLayerInner {
loaded: true,
book: Some(book),
image_type,
};
Ok(inner)
}
fn open_book(&self) -> Result<(PathBuf, Book<File>)> {
let path = Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&ImageFileName {
seg: self.seg,
lsn: self.lsn,
},
);
let file = File::open(&path)?;
let book = Book::new(file)?;
Ok((path, book))
}
/// Create an ImageLayer struct representing an existing file on disk
pub fn new(
conf: &'static PageServerConf,
@@ -406,7 +465,7 @@ impl ImageLayer {
seg: filename.seg,
lsn: filename.lsn,
inner: Mutex::new(ImageLayerInner {
loaded: false,
book: None,
image_type: ImageType::Blocky { num_blocks: 0 },
}),
}
@@ -415,22 +474,40 @@ impl ImageLayer {
/// Create an ImageLayer struct representing an existing file on disk.
///
/// This variant is only used for debugging purposes, by the 'dump_layerfile' binary.
pub fn new_for_path(
path: &Path,
timelineid: ZTimelineId,
tenantid: ZTenantId,
filename: &ImageFileName,
) -> ImageLayer {
ImageLayer {
pub fn new_for_path<F>(path: &Path, book: &Book<F>) -> Result<ImageLayer>
where
F: std::os::unix::prelude::FileExt,
{
let chapter = book.read_chapter(SUMMARY_CHAPTER)?;
let summary = Summary::des(&chapter)?;
Ok(ImageLayer {
path_or_conf: PathOrConf::Path(path.to_path_buf()),
timelineid,
tenantid,
seg: filename.seg,
lsn: filename.lsn,
timelineid: summary.timelineid,
tenantid: summary.tenantid,
seg: summary.seg,
lsn: summary.lsn,
inner: Mutex::new(ImageLayerInner {
loaded: false,
book: None,
image_type: ImageType::Blocky { num_blocks: 0 },
}),
})
}
fn layer_name(&self) -> ImageFileName {
ImageFileName {
seg: self.seg,
lsn: self.lsn,
}
}
/// Path to the layer file in pageserver workdir.
pub fn path(&self) -> PathBuf {
Self::path_for(
&self.path_or_conf,
self.timelineid,
self.tenantid,
&self.layer_name(),
)
}
}

View File

@@ -1,7 +1,10 @@
//! An in-memory layer stores recently received PageVersions.
//! The page versions are held in a BTreeMap. To avoid OOM errors, the map size is limited
//! and layers can be spilled to disk into ephemeral files.
//!
//! An in-memory layer stores recently received page versions in memory. The page versions
//! are held in a BTreeMap, and there's another BTreeMap to track the size of the relation.
//! And there's another BTreeMap to track the size of the relation.
//!
use crate::layered_repository::ephemeral_file::EphemeralFile;
use crate::layered_repository::filename::DeltaFileName;
use crate::layered_repository::storage_layer::{
Layer, PageReconstructData, PageReconstructResult, PageVersion, SegmentTag, RELISH_SEG_SIZE,
@@ -12,16 +15,15 @@ use crate::layered_repository::{DeltaLayer, ImageLayer};
use crate::repository::WALRecord;
use crate::PageServerConf;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{bail, Result};
use anyhow::{ensure, Result};
use bytes::Bytes;
use log::*;
use std::cmp::Ordering;
use std::collections::BTreeMap;
use std::ops::Bound::Included;
use std::path::PathBuf;
use std::sync::{Arc, Mutex};
use std::sync::{Arc, RwLock};
use zenith_utils::lsn::Lsn;
use zenith_utils::vec_map::VecMap;
use super::page_versions::PageVersions;
pub struct InMemoryLayer {
conf: &'static PageServerConf,
@@ -31,8 +33,7 @@ pub struct InMemoryLayer {
///
/// This layer contains all the changes from 'start_lsn'. The
/// start is inclusive. There is no end LSN; we only use in-memory
/// layer at the end of a timeline.
/// start is inclusive.
///
start_lsn: Lsn,
@@ -41,68 +42,84 @@ pub struct InMemoryLayer {
/// The above fields never change. The parts that do change are in 'inner',
/// and protected by mutex.
inner: Mutex<InMemoryLayerInner>,
inner: RwLock<InMemoryLayerInner>,
/// Predecessor layer
predecessor: Option<Arc<dyn Layer>>,
/// Predecessor layer might be needed?
incremental: bool,
}
pub struct InMemoryLayerInner {
/// Frozen layers have an exclusive end LSN.
/// Writes are only allowed when this is None
end_lsn: Option<Lsn>,
/// If this relation was dropped, remember when that happened.
drop_lsn: Option<Lsn>,
/// The drop LSN is recorded in [`end_lsn`].
dropped: bool,
///
/// All versions of all pages in the layer are are kept here.
/// Indexed by block number and LSN.
///
page_versions: BTreeMap<(u32, Lsn), PageVersion>,
page_versions: PageVersions,
///
/// `segsizes` tracks the size of the segment at different points in time.
///
segsizes: BTreeMap<Lsn, u32>,
/// For a blocky rel, there is always one entry, at the layer's start_lsn,
/// so that determining the size never depends on the predecessor layer. For
/// a non-blocky rel, 'segsizes' is not used and is always empty.
///
segsizes: VecMap<Lsn, u32>,
}
impl InMemoryLayerInner {
fn assert_writeable(&self) {
assert!(self.end_lsn.is_none());
}
fn get_seg_size(&self, lsn: Lsn) -> u32 {
// Scan the BTreeMap backwards, starting from the given entry.
let mut iter = self.segsizes.range((Included(&Lsn(0)), Included(&lsn)));
let slice = self.segsizes.slice_range(..=lsn);
if let Some((_entry_lsn, entry)) = iter.next_back() {
// We make sure there is always at least one entry
if let Some((_entry_lsn, entry)) = slice.last() {
*entry
} else {
0
panic!("could not find seg size in in-memory layer");
}
}
}
impl Layer for InMemoryLayer {
// An in-memory layer doesn't really have a filename as it's not stored on disk,
// but we construct a filename as if it was a delta layer
// An in-memory layer can be spilled to disk into ephemeral file,
// This function is used only for debugging, so we don't need to be very precise.
// Construct a filename as if it was a delta layer.
fn filename(&self) -> PathBuf {
let inner = self.inner.lock().unwrap();
let inner = self.inner.read().unwrap();
let end_lsn;
let dropped;
if let Some(drop_lsn) = inner.drop_lsn {
if let Some(drop_lsn) = inner.end_lsn {
end_lsn = drop_lsn;
dropped = true;
} else {
end_lsn = Lsn(u64::MAX);
dropped = false;
}
let delta_filename = DeltaFileName {
seg: self.seg,
start_lsn: self.start_lsn,
end_lsn,
dropped,
dropped: inner.dropped,
}
.to_string();
PathBuf::from(format!("inmem-{}", delta_filename))
}
fn get_tenant_id(&self) -> ZTenantId {
self.tenantid
}
fn get_timeline_id(&self) -> ZTimelineId {
self.timelineid
}
@@ -116,18 +133,18 @@ impl Layer for InMemoryLayer {
}
fn get_end_lsn(&self) -> Lsn {
let inner = self.inner.lock().unwrap();
let inner = self.inner.read().unwrap();
if let Some(drop_lsn) = inner.drop_lsn {
drop_lsn
if let Some(end_lsn) = inner.end_lsn {
end_lsn
} else {
Lsn(u64::MAX)
}
}
fn is_dropped(&self) -> bool {
let inner = self.inner.lock().unwrap();
inner.drop_lsn.is_some()
let inner = self.inner.read().unwrap();
inner.dropped
}
/// Look up given page in the cache.
@@ -135,55 +152,57 @@ impl Layer for InMemoryLayer {
&self,
blknum: u32,
lsn: Lsn,
cached_img_lsn: Option<Lsn>,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult> {
let mut cont_lsn: Option<Lsn> = Some(lsn);
let mut need_image = true;
assert!(self.seg.blknum_in_seg(blknum));
{
let inner = self.inner.lock().unwrap();
let inner = self.inner.read().unwrap();
// Scan the BTreeMap backwards, starting from reconstruct_data.lsn.
let minkey = (blknum, Lsn(0));
let maxkey = (blknum, lsn);
let mut iter = inner
// Scan the page versions backwards, starting from `lsn`.
let iter = inner
.page_versions
.range((Included(&minkey), Included(&maxkey)));
while let Some(((_blknum, entry_lsn), entry)) = iter.next_back() {
if let Some(img) = &entry.page_image {
reconstruct_data.page_img = Some(img.clone());
cont_lsn = None;
break;
} else if let Some(rec) = &entry.record {
reconstruct_data.records.push(rec.clone());
if rec.will_init {
// This WAL record initializes the page, so no need to go further back
cont_lsn = None;
break;
} else {
// This WAL record needs to be applied against an older page image
cont_lsn = Some(*entry_lsn);
.get_block_lsn_range(blknum, ..=lsn)
.iter()
.rev();
for (entry_lsn, pos) in iter {
match &cached_img_lsn {
Some(cached_lsn) if entry_lsn <= cached_lsn => {
return Ok(PageReconstructResult::Cached)
}
_ => {}
}
let pv = inner.page_versions.get_page_version(*pos)?;
match pv {
PageVersion::Page(img) => {
reconstruct_data.page_img = Some(img);
need_image = false;
break;
}
PageVersion::Wal(rec) => {
reconstruct_data.records.push((*entry_lsn, rec.clone()));
if rec.will_init {
// This WAL record initializes the page, so no need to go further back
need_image = false;
break;
}
}
} else {
// No base image, and no WAL record. Huh?
bail!("no page image or WAL record for requested page");
}
}
// release lock on 'inner'
}
// If an older page image is needed to reconstruct the page, let the
// caller know about the predecessor layer.
if let Some(cont_lsn) = cont_lsn {
if let Some(cont_layer) = &self.predecessor {
Ok(PageReconstructResult::Continue(
cont_lsn,
Arc::clone(cont_layer),
))
// caller know
if need_image {
if self.incremental {
Ok(PageReconstructResult::Continue(Lsn(self.start_lsn.0 - 1)))
} else {
Ok(PageReconstructResult::Missing(cont_lsn))
Ok(PageReconstructResult::Missing(self.start_lsn))
}
} else {
Ok(PageReconstructResult::Complete)
@@ -193,19 +212,32 @@ impl Layer for InMemoryLayer {
/// Get size of the relation at given LSN
fn get_seg_size(&self, lsn: Lsn) -> Result<u32> {
assert!(lsn >= self.start_lsn);
ensure!(
self.seg.rel.is_blocky(),
"get_seg_size() called on a non-blocky rel"
);
let inner = self.inner.lock().unwrap();
let inner = self.inner.read().unwrap();
Ok(inner.get_seg_size(lsn))
}
/// Does this segment exist at given LSN?
fn get_seg_exists(&self, lsn: Lsn) -> Result<bool> {
let inner = self.inner.lock().unwrap();
let inner = self.inner.read().unwrap();
// If the segment created after requested LSN,
// it doesn't exist in the layer. But we shouldn't
// have requested it in the first place.
assert!(lsn >= self.start_lsn);
// Is the requested LSN after the segment was dropped?
if let Some(drop_lsn) = inner.drop_lsn {
if lsn >= drop_lsn {
return Ok(false);
if inner.dropped {
if let Some(end_lsn) = inner.end_lsn {
if lsn >= end_lsn {
return Ok(false);
}
} else {
panic!("dropped in-memory layer with no end LSN");
}
}
@@ -223,49 +255,55 @@ impl Layer for InMemoryLayer {
/// Nothing to do here. When you drop the last reference to the layer, it will
/// be deallocated.
fn delete(&self) -> Result<()> {
Ok(())
panic!("can't delete an InMemoryLayer")
}
fn is_incremental(&self) -> bool {
self.predecessor.is_some()
self.incremental
}
fn is_in_memory(&self) -> bool {
true
}
/// debugging function to print out the contents of the layer
fn dump(&self) -> Result<()> {
let inner = self.inner.lock().unwrap();
let inner = self.inner.read().unwrap();
let end_str = inner
.drop_lsn
.end_lsn
.as_ref()
.map(|drop_lsn| drop_lsn.to_string())
.map(Lsn::to_string)
.unwrap_or_default();
println!(
"----- in-memory layer for tli {} seg {} {}-{} ----",
self.timelineid, self.seg, self.start_lsn, end_str
"----- in-memory layer for tli {} seg {} {}-{} {} ----",
self.timelineid, self.seg, self.start_lsn, end_str, inner.dropped,
);
for (k, v) in inner.segsizes.iter() {
for (k, v) in inner.segsizes.as_slice() {
println!("segsizes {}: {}", k, v);
}
for (k, v) in inner.page_versions.iter() {
println!(
"blk {} at {}: {}/{}\n",
k.0,
k.1,
v.page_image.is_some(),
v.record.is_some()
);
for (blknum, lsn, pos) in inner.page_versions.ordered_page_version_iter(None) {
let pv = inner.page_versions.get_page_version(pos)?;
let pv_description = match pv {
PageVersion::Page(_img) => "page",
PageVersion::Wal(_rec) => "wal",
};
println!("blk {} at {}: {}\n", blknum, lsn, pv_description);
}
Ok(())
}
}
// Type alias to simplify InMemoryLayer::freeze signature
//
type SuccessorLayers = (Vec<Arc<dyn Layer>>, Option<Arc<InMemoryLayer>>);
/// A result of an inmemory layer data being written to disk.
pub struct LayersOnDisk {
pub delta_layers: Vec<DeltaLayer>,
pub image_layers: Vec<ImageLayer>,
}
impl InMemoryLayer {
/// Return the oldest page version that's stored in this layer
@@ -291,6 +329,14 @@ impl InMemoryLayer {
start_lsn
);
// The segment is initially empty, so initialize 'segsizes' with 0.
let mut segsizes = VecMap::default();
if seg.rel.is_blocky() {
segsizes.append(start_lsn, 0).unwrap();
}
let file = EphemeralFile::create(conf, tenantid, timelineid)?;
Ok(InMemoryLayer {
conf,
timelineid,
@@ -298,44 +344,31 @@ impl InMemoryLayer {
seg,
start_lsn,
oldest_pending_lsn,
inner: Mutex::new(InMemoryLayerInner {
drop_lsn: None,
page_versions: BTreeMap::new(),
segsizes: BTreeMap::new(),
incremental: false,
inner: RwLock::new(InMemoryLayerInner {
end_lsn: None,
dropped: false,
page_versions: PageVersions::new(file),
segsizes,
}),
predecessor: None,
})
}
// Write operations
/// Remember new page version, as a WAL record over previous version
pub fn put_wal_record(&self, blknum: u32, rec: WALRecord) -> Result<()> {
self.put_page_version(
blknum,
rec.lsn,
PageVersion {
page_image: None,
record: Some(rec),
},
)
pub fn put_wal_record(&self, lsn: Lsn, blknum: u32, rec: WALRecord) -> Result<u32> {
self.put_page_version(blknum, lsn, PageVersion::Wal(rec))
}
/// Remember new page version, as a full page image
pub fn put_page_image(&self, blknum: u32, lsn: Lsn, img: Bytes) -> Result<()> {
self.put_page_version(
blknum,
lsn,
PageVersion {
page_image: Some(img),
record: None,
},
)
pub fn put_page_image(&self, blknum: u32, lsn: Lsn, img: Bytes) -> Result<u32> {
self.put_page_version(blknum, lsn, PageVersion::Page(img))
}
/// Common subroutine of the public put_wal_record() and put_page_image() functions.
/// Adds the page version to the in-memory tree
pub fn put_page_version(&self, blknum: u32, lsn: Lsn, pv: PageVersion) -> Result<()> {
pub fn put_page_version(&self, blknum: u32, lsn: Lsn, pv: PageVersion) -> Result<u32> {
assert!(self.seg.blknum_in_seg(blknum));
trace!(
@@ -345,9 +378,11 @@ impl InMemoryLayer {
self.timelineid,
lsn
);
let mut inner = self.inner.lock().unwrap();
let mut inner = self.inner.write().unwrap();
let old = inner.page_versions.insert((blknum, lsn), pv);
inner.assert_writeable();
let old = inner.page_versions.append_or_update_last(blknum, lsn, pv)?;
if old.is_some() {
// We already had an entry for this LSN. That's odd..
@@ -361,7 +396,7 @@ impl InMemoryLayer {
if self.seg.rel.is_blocky() {
let newsize = blknum - self.seg.segno * RELISH_SEG_SIZE + 1;
// use inner get_seg_size, since calling self.get_seg_size will try to acquire self.inner.lock
// use inner get_seg_size, since calling self.get_seg_size will try to acquire the lock,
// which we've just acquired above
let oldsize = inner.get_seg_size(lsn);
if newsize > oldsize {
@@ -383,15 +418,15 @@ impl InMemoryLayer {
// subsequent call to initialize the gap page.
let gapstart = self.seg.segno * RELISH_SEG_SIZE + oldsize;
for gapblknum in gapstart..blknum {
let zeropv = PageVersion {
page_image: Some(ZERO_PAGE.clone()),
record: None,
};
println!(
let zeropv = PageVersion::Page(ZERO_PAGE.clone());
trace!(
"filling gap blk {} with zeros for write of {}",
gapblknum, blknum
gapblknum,
blknum
);
let old = inner.page_versions.insert((gapblknum, lsn), zeropv);
let old = inner
.page_versions
.append_or_update_last(gapblknum, lsn, zeropv)?;
// We already had an entry for this LSN. That's odd..
if old.is_some() {
@@ -402,36 +437,47 @@ impl InMemoryLayer {
}
}
inner.segsizes.insert(lsn, newsize);
inner.segsizes.append_or_update_last(lsn, newsize).unwrap();
return Ok(newsize - oldsize);
}
}
Ok(())
Ok(0)
}
/// Remember that the relation was truncated at given LSN
pub fn put_truncation(&self, lsn: Lsn, segsize: u32) -> anyhow::Result<()> {
let mut inner = self.inner.lock().unwrap();
let old = inner.segsizes.insert(lsn, segsize);
pub fn put_truncation(&self, lsn: Lsn, segsize: u32) {
assert!(
self.seg.rel.is_blocky(),
"put_truncation() called on a non-blocky rel"
);
let mut inner = self.inner.write().unwrap();
inner.assert_writeable();
// check that this we truncate to a smaller size than segment was before the truncation
let oldsize = inner.get_seg_size(lsn);
assert!(segsize < oldsize);
let (old, _delta_size) = inner.segsizes.append_or_update_last(lsn, segsize).unwrap();
if old.is_some() {
// We already had an entry for this LSN. That's odd..
warn!("Inserting truncation, but had an entry for the LSN already");
}
Ok(())
}
/// Remember that the segment was dropped at given LSN
pub fn drop_segment(&self, lsn: Lsn) -> anyhow::Result<()> {
let mut inner = self.inner.lock().unwrap();
pub fn drop_segment(&self, lsn: Lsn) {
let mut inner = self.inner.write().unwrap();
assert!(inner.drop_lsn.is_none());
inner.drop_lsn = Some(lsn);
assert!(inner.end_lsn.is_none());
assert!(!inner.dropped);
inner.dropped = true;
assert!(self.start_lsn < lsn);
inner.end_lsn = Some(lsn);
info!("dropped segment {} at {}", self.seg, lsn);
Ok(())
trace!("dropped segment {} at {}", self.seg, lsn);
}
///
@@ -448,6 +494,9 @@ impl InMemoryLayer {
) -> Result<InMemoryLayer> {
let seg = src.get_seg_tag();
assert!(oldest_pending_lsn.is_aligned());
assert!(oldest_pending_lsn >= start_lsn);
trace!(
"initializing new InMemoryLayer for writing {} on timeline {} at {}",
seg,
@@ -455,13 +504,15 @@ impl InMemoryLayer {
start_lsn,
);
// For convenience, copy the segment size from the predecessor layer
let mut segsizes = BTreeMap::new();
// Copy the segment size at the start LSN from the predecessor layer.
let mut segsizes = VecMap::default();
if seg.rel.is_blocky() {
let size = src.get_seg_size(start_lsn)?;
segsizes.insert(start_lsn, size);
segsizes.append(start_lsn, size).unwrap();
}
let file = EphemeralFile::create(conf, tenantid, timelineid)?;
Ok(InMemoryLayer {
conf,
timelineid,
@@ -469,100 +520,105 @@ impl InMemoryLayer {
seg,
start_lsn,
oldest_pending_lsn,
inner: Mutex::new(InMemoryLayerInner {
drop_lsn: None,
page_versions: BTreeMap::new(),
incremental: true,
inner: RwLock::new(InMemoryLayerInner {
end_lsn: None,
dropped: false,
page_versions: PageVersions::new(file),
segsizes,
}),
predecessor: Some(src),
})
}
pub fn is_writeable(&self) -> bool {
let inner = self.inner.read().unwrap();
inner.end_lsn.is_none()
}
/// Make the layer non-writeable. Only call once.
/// Records the end_lsn for non-dropped layers.
/// `end_lsn` is inclusive
pub fn freeze(&self, end_lsn: Lsn) {
let mut inner = self.inner.write().unwrap();
if inner.end_lsn.is_some() {
assert!(inner.dropped);
} else {
assert!(!inner.dropped);
assert!(self.start_lsn < end_lsn + 1);
inner.end_lsn = Some(Lsn(end_lsn.0 + 1));
if let Some((lsn, _)) = inner.segsizes.as_slice().last() {
assert!(lsn <= &end_lsn, "{:?} {:?}", lsn, end_lsn);
}
for (_blk, lsn, _pv) in inner.page_versions.ordered_page_version_iter(None) {
assert!(lsn <= end_lsn);
}
}
}
/// Write the this frozen in-memory layer to disk.
///
/// Write the this in-memory layer to disk.
///
/// The cutoff point for the layer that's written to disk is 'end_lsn'.
///
/// Returns new layers that replace this one. Always returns a new image
/// layer containing the page versions at the cutoff LSN, that were written
/// to disk, and usually also a DeltaLayer that includes all the WAL records
/// between start LSN and the cutoff. (The delta layer is not needed when
/// a new relish is created with a single LSN, so that the start and end LSN
/// are the same.) If there were page versions newer than 'end_lsn', also
/// returns a new in-memory layer containing those page versions. The caller
/// replaces this layer with the returned layers in the layer map.
///
pub fn freeze(
&self,
cutoff_lsn: Lsn,
// This is needed just to call materialize_page()
timeline: &LayeredTimeline,
) -> Result<SuccessorLayers> {
info!(
"freezing in memory layer for {} on timeline {} at {}",
self.seg, self.timelineid, cutoff_lsn
/// Returns new layers that replace this one.
/// If not dropped, returns a new image layer containing the page versions
/// at the `end_lsn`. Can also return a DeltaLayer that includes all the
/// WAL records between start and end LSN. (The delta layer is not needed
/// when a new relish is created with a single LSN, so that the start and
/// end LSN are the same.)
pub fn write_to_disk(&self, timeline: &LayeredTimeline) -> Result<LayersOnDisk> {
trace!(
"write_to_disk {} get_end_lsn is {}",
self.filename().display(),
self.get_end_lsn()
);
let inner = self.inner.lock().unwrap();
// Grab the lock in read-mode. We hold it over the I/O, but because this
// layer is not writeable anymore, no one should be trying to acquire the
// write lock on it, so we shouldn't block anyone. There's one exception
// though: another thread might have grabbed a reference to this layer
// in `get_layer_for_write' just before the checkpointer called
// `freeze`, and then `write_to_disk` on it. When the thread gets the
// lock, it will see that it's not writeable anymore and retry, but it
// would have to wait until we release it. That race condition is very
// rare though, so we just accept the potential latency hit for now.
let inner = self.inner.read().unwrap();
let end_lsn_exclusive = inner.end_lsn.unwrap();
// Normally, use the cutoff LSN as the end of the frozen layer.
// But if the relation was dropped, we know that there are no
// more changes coming in for it, and in particular we know that
// there are no changes "in flight" for the LSN anymore, so we use
// the drop LSN instead. The drop-LSN could be ahead of the
// caller-specified LSN!
let dropped = inner.drop_lsn.is_some();
let end_lsn = if dropped {
inner.drop_lsn.unwrap()
} else {
cutoff_lsn
};
// Divide all the page versions into old and new at the 'end_lsn' cutoff point.
let mut before_page_versions;
let mut before_segsizes;
let mut after_page_versions;
let mut after_segsizes;
if !dropped {
before_segsizes = BTreeMap::new();
after_segsizes = BTreeMap::new();
for (lsn, size) in inner.segsizes.iter() {
if *lsn > end_lsn {
after_segsizes.insert(*lsn, *size);
} else {
before_segsizes.insert(*lsn, *size);
}
}
before_page_versions = BTreeMap::new();
after_page_versions = BTreeMap::new();
for ((blknum, lsn), pv) in inner.page_versions.iter() {
match lsn.cmp(&end_lsn) {
Ordering::Less => {
before_page_versions.insert((*blknum, *lsn), pv.clone());
}
Ordering::Equal => {
// Page versions at the cutoff LSN will be stored in the
// materialized image layer.
}
Ordering::Greater => {
after_page_versions.insert((*blknum, *lsn), pv.clone());
}
}
}
} else {
before_page_versions = inner.page_versions.clone();
before_segsizes = inner.segsizes.clone();
after_segsizes = BTreeMap::new();
after_page_versions = BTreeMap::new();
if inner.dropped {
let delta_layer = DeltaLayer::create(
self.conf,
self.timelineid,
self.tenantid,
self.seg,
self.start_lsn,
end_lsn_exclusive,
true,
&inner.page_versions,
None,
inner.segsizes.clone(),
)?;
trace!(
"freeze: created delta layer for dropped segment {} {}-{}",
self.seg,
self.start_lsn,
end_lsn_exclusive
);
return Ok(LayersOnDisk {
delta_layers: vec![delta_layer],
image_layers: Vec::new(),
});
}
// we can release the lock now.
drop(inner);
// Since `end_lsn` is inclusive, subtract 1.
// We want to make an ImageLayer for the last included LSN,
// so the DeltaLayer should exclude that LSN.
let end_lsn_inclusive = Lsn(end_lsn_exclusive.0 - 1);
let mut frozen_layers: Vec<Arc<dyn Layer>> = Vec::new();
let mut delta_layers = Vec::new();
if self.start_lsn != end_lsn {
if self.start_lsn != end_lsn_inclusive {
let (segsizes, _) = inner.segsizes.split_at(&end_lsn_exclusive);
// Write the page versions before the cutoff to disk.
let delta_layer = DeltaLayer::create(
self.conf,
@@ -570,53 +626,41 @@ impl InMemoryLayer {
self.tenantid,
self.seg,
self.start_lsn,
end_lsn,
dropped,
self.predecessor.clone(),
before_page_versions,
before_segsizes,
end_lsn_inclusive,
false,
&inner.page_versions,
Some(end_lsn_inclusive),
segsizes,
)?;
let delta_layer_rc: Arc<dyn Layer> = Arc::new(delta_layer);
frozen_layers.push(delta_layer_rc);
delta_layers.push(delta_layer);
trace!(
"freeze: created delta layer {} {}-{}",
self.seg,
self.start_lsn,
end_lsn
end_lsn_inclusive
);
} else {
assert!(before_page_versions.is_empty());
assert!(inner
.page_versions
.ordered_page_version_iter(None)
.next()
.is_none());
}
let mut new_open_rc = None;
if !dropped {
// Write a new base image layer at the cutoff point
let imgfile = ImageLayer::create_from_src(self.conf, timeline, self, end_lsn)?;
let imgfile_rc: Arc<dyn Layer> = Arc::new(imgfile);
frozen_layers.push(Arc::clone(&imgfile_rc));
trace!("freeze: created image layer {} at {}", self.seg, end_lsn);
drop(inner);
// If there were any page versions newer than the cutoff, initialize a new in-memory
// layer to hold them
if !after_segsizes.is_empty() || !after_page_versions.is_empty() {
let new_open = Self::create_successor_layer(
self.conf,
imgfile_rc,
self.timelineid,
self.tenantid,
end_lsn,
end_lsn,
)?;
let mut new_inner = new_open.inner.lock().unwrap();
new_inner.page_versions.append(&mut after_page_versions);
new_inner.segsizes.append(&mut after_segsizes);
drop(new_inner);
trace!("freeze: created new in-mem layer {} {}-", self.seg, end_lsn);
// Write a new base image layer at the cutoff point
let image_layer =
ImageLayer::create_from_src(self.conf, timeline, self, end_lsn_inclusive)?;
trace!(
"freeze: created image layer {} at {}",
self.seg,
end_lsn_inclusive
);
new_open_rc = Some(Arc::new(new_open))
}
}
Ok((frozen_layers, new_open_rc))
Ok(LayersOnDisk {
delta_layers,
image_layers: vec![image_layer],
})
}
}

View File

@@ -0,0 +1,468 @@
///
/// IntervalTree is data structure for holding intervals. It is generic
/// to make unit testing possible, but the only real user of it is the layer map,
///
/// It's inspired by the "segment tree" or a "statistic tree" as described in
/// https://en.wikipedia.org/wiki/Segment_tree. However, we use a B-tree to hold
/// the points instead of a binary tree. This is called an "interval tree" instead
/// of "segment tree" because the term "segment" is already using Zenith to mean
/// something else. To add to the confusion, there is another data structure known
/// as "interval tree" out there (see https://en.wikipedia.org/wiki/Interval_tree),
/// for storing intervals, but this isn't that.
///
/// The basic idea is to have a B-tree of "interesting Points". At each Point,
/// there is a list of intervals that contain the point. The Points are formed
/// from the start bounds of each interval; there is a Point for each distinct
/// start bound.
///
/// Operations:
///
/// To find intervals that contain a given point, you search the b-tree to find
/// the nearest Point <= search key. Then you just return the list of intervals.
///
/// To insert an interval, find the Point with start key equal to the inserted item.
/// If the Point doesn't exist yet, create it, by copying all the items from the
/// previous Point that cover the new Point. Then walk right, inserting the new
/// interval to all the Points that are contained by the new interval (including the
/// newly created Point).
///
/// To remove an interval, you scan the tree for all the Points that are contained by
/// the removed interval, and remove it from the list in each Point.
///
/// Requirements and assumptions:
///
/// - Can store overlapping items
/// - But there are not many overlapping items
/// - The interval bounds don't change after it is added to the tree
/// - Intervals are uniquely identified by pointer equality. You must not be insert the
/// same interval object twice, and `remove` uses pointer equality to remove the right
/// interval. It is OK to have two intervals with the same bounds, however.
///
use std::collections::BTreeMap;
use std::fmt::Debug;
use std::ops::Range;
use std::sync::Arc;
pub struct IntervalTree<I: ?Sized>
where
I: IntervalItem,
{
points: BTreeMap<I::Key, Point<I>>,
}
struct Point<I: ?Sized> {
/// All intervals that contain this point, in no particular order.
///
/// We assume that there aren't a lot of overlappingg intervals, so that this vector
/// never grows very large. If that assumption doesn't hold, we could keep this ordered
/// by the end bound, to speed up `search`. But as long as there are only a few elements,
/// a linear search is OK.
elements: Vec<Arc<I>>,
}
/// Abstraction for an interval that can be stored in the tree
///
/// The start bound is inclusive and the end bound is exclusive. End must be greater
/// than start.
pub trait IntervalItem {
type Key: Ord + Copy + Debug + Sized;
fn start_key(&self) -> Self::Key;
fn end_key(&self) -> Self::Key;
fn bounds(&self) -> Range<Self::Key> {
self.start_key()..self.end_key()
}
}
impl<I: ?Sized> IntervalTree<I>
where
I: IntervalItem,
{
/// Return an element that contains 'key', or precedes it.
///
/// If there are multiple candidates, returns the one with the highest 'end' key.
pub fn search(&self, key: I::Key) -> Option<Arc<I>> {
// Find the greatest point that precedes or is equal to the search key. If there is
// none, returns None.
let (_, p) = self.points.range(..=key).next_back()?;
// Find the element with the highest end key at this point
let highest_item = p
.elements
.iter()
.reduce(|a, b| {
// starting with Rust 1.53, could use `std::cmp::min_by_key` here
if a.end_key() > b.end_key() {
a
} else {
b
}
})
.unwrap();
Some(Arc::clone(highest_item))
}
/// Iterate over all items with start bound >= 'key'
pub fn iter_newer(&self, key: I::Key) -> IntervalIter<I> {
IntervalIter {
point_iter: self.points.range(key..),
elem_iter: None,
}
}
/// Iterate over all items
pub fn iter(&self) -> IntervalIter<I> {
IntervalIter {
point_iter: self.points.range(..),
elem_iter: None,
}
}
pub fn insert(&mut self, item: Arc<I>) {
let start_key = item.start_key();
let end_key = item.end_key();
assert!(start_key < end_key);
let bounds = start_key..end_key;
// Find the starting point and walk forward from there
let mut found_start_point = false;
let iter = self.points.range_mut(bounds);
for (point_key, point) in iter {
if *point_key == start_key {
found_start_point = true;
// It is an error to insert the same item to the tree twice.
assert!(
!point.elements.iter().any(|x| Arc::ptr_eq(x, &item)),
"interval is already in the tree"
);
}
point.elements.push(Arc::clone(&item));
}
if !found_start_point {
// Create a new Point for the starting point
// Look at the previous point, and copy over elements that overlap with this
// new point
let mut new_elements: Vec<Arc<I>> = Vec::new();
if let Some((_, prev_point)) = self.points.range(..start_key).next_back() {
let overlapping_prev_elements = prev_point
.elements
.iter()
.filter(|x| x.bounds().contains(&start_key))
.cloned();
new_elements.extend(overlapping_prev_elements);
}
new_elements.push(item);
let new_point = Point {
elements: new_elements,
};
self.points.insert(start_key, new_point);
}
}
pub fn remove(&mut self, item: &Arc<I>) {
// range search points
let start_key = item.start_key();
let end_key = item.end_key();
let bounds = start_key..end_key;
let mut points_to_remove: Vec<I::Key> = Vec::new();
let mut found_start_point = false;
for (point_key, point) in self.points.range_mut(bounds) {
if *point_key == start_key {
found_start_point = true;
}
let len_before = point.elements.len();
point.elements.retain(|other| !Arc::ptr_eq(other, item));
let len_after = point.elements.len();
assert_eq!(len_after + 1, len_before);
if len_after == 0 {
points_to_remove.push(*point_key);
}
}
assert!(found_start_point);
for k in points_to_remove {
self.points.remove(&k).unwrap();
}
}
}
pub struct IntervalIter<'a, I: ?Sized>
where
I: IntervalItem,
{
point_iter: std::collections::btree_map::Range<'a, I::Key, Point<I>>,
elem_iter: Option<(I::Key, std::slice::Iter<'a, Arc<I>>)>,
}
impl<'a, I> Iterator for IntervalIter<'a, I>
where
I: IntervalItem + ?Sized,
{
type Item = Arc<I>;
fn next(&mut self) -> Option<Self::Item> {
// Iterate over all elements in all the points in 'point_iter'. To avoid
// returning the same element twice, we only return each element at its
// starting point.
loop {
// Return next remaining element from the current point
if let Some((point_key, elem_iter)) = &mut self.elem_iter {
for elem in elem_iter {
if elem.start_key() == *point_key {
return Some(Arc::clone(elem));
}
}
}
// No more elements at this point. Move to next point.
if let Some((point_key, point)) = self.point_iter.next() {
self.elem_iter = Some((*point_key, point.elements.iter()));
continue;
} else {
// No more points, all done
return None;
}
}
}
}
impl<I: ?Sized> Default for IntervalTree<I>
where
I: IntervalItem,
{
fn default() -> Self {
IntervalTree {
points: BTreeMap::new(),
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::fmt;
#[derive(Debug)]
struct MockItem {
start_key: u32,
end_key: u32,
val: String,
}
impl IntervalItem for MockItem {
type Key = u32;
fn start_key(&self) -> u32 {
self.start_key
}
fn end_key(&self) -> u32 {
self.end_key
}
}
impl MockItem {
fn new(start_key: u32, end_key: u32) -> Self {
MockItem {
start_key,
end_key,
val: format!("{}-{}", start_key, end_key),
}
}
fn new_str(start_key: u32, end_key: u32, val: &str) -> Self {
MockItem {
start_key,
end_key,
val: format!("{}-{}: {}", start_key, end_key, val),
}
}
}
impl fmt::Display for MockItem {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{}", self.val)
}
}
#[rustfmt::skip]
fn assert_search(
tree: &IntervalTree<MockItem>,
key: u32,
expected: &[&str],
) -> Option<Arc<MockItem>> {
if let Some(v) = tree.search(key) {
let vstr = v.to_string();
assert!(!expected.is_empty(), "search with {} returned {}, expected None", key, v);
assert!(
expected.contains(&vstr.as_str()),
"search with {} returned {}, expected one of: {:?}",
key, v, expected,
);
Some(v)
} else {
assert!(
expected.is_empty(),
"search with {} returned None, expected one of {:?}",
key, expected
);
None
}
}
fn assert_contents(tree: &IntervalTree<MockItem>, expected: &[&str]) {
let mut contents: Vec<String> = tree.iter().map(|e| e.to_string()).collect();
contents.sort();
assert_eq!(contents, expected);
}
fn dump_tree(tree: &IntervalTree<MockItem>) {
for (point_key, point) in tree.points.iter() {
print!("{}:", point_key);
for e in point.elements.iter() {
print!(" {}", e);
}
println!();
}
}
#[test]
fn test_interval_tree_simple() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Simple, non-overlapping ranges.
tree.insert(Arc::new(MockItem::new(10, 11)));
tree.insert(Arc::new(MockItem::new(11, 12)));
tree.insert(Arc::new(MockItem::new(12, 13)));
tree.insert(Arc::new(MockItem::new(18, 19)));
tree.insert(Arc::new(MockItem::new(17, 18)));
tree.insert(Arc::new(MockItem::new(15, 16)));
assert_search(&tree, 9, &[]);
assert_search(&tree, 10, &["10-11"]);
assert_search(&tree, 11, &["11-12"]);
assert_search(&tree, 12, &["12-13"]);
assert_search(&tree, 13, &["12-13"]);
assert_search(&tree, 14, &["12-13"]);
assert_search(&tree, 15, &["15-16"]);
assert_search(&tree, 16, &["15-16"]);
assert_search(&tree, 17, &["17-18"]);
assert_search(&tree, 18, &["18-19"]);
assert_search(&tree, 19, &["18-19"]);
assert_search(&tree, 20, &["18-19"]);
// remove a few entries and search around them again
tree.remove(&assert_search(&tree, 10, &["10-11"]).unwrap()); // first entry
tree.remove(&assert_search(&tree, 12, &["12-13"]).unwrap()); // entry in the middle
tree.remove(&assert_search(&tree, 18, &["18-19"]).unwrap()); // last entry
assert_search(&tree, 9, &[]);
assert_search(&tree, 10, &[]);
assert_search(&tree, 11, &["11-12"]);
assert_search(&tree, 12, &["11-12"]);
assert_search(&tree, 14, &["11-12"]);
assert_search(&tree, 15, &["15-16"]);
assert_search(&tree, 17, &["17-18"]);
assert_search(&tree, 18, &["17-18"]);
}
#[test]
fn test_interval_tree_overlap() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Overlapping items
tree.insert(Arc::new(MockItem::new(22, 24)));
tree.insert(Arc::new(MockItem::new(23, 25)));
let x24_26 = Arc::new(MockItem::new(24, 26));
tree.insert(Arc::clone(&x24_26));
let x26_28 = Arc::new(MockItem::new(26, 28));
tree.insert(Arc::clone(&x26_28));
tree.insert(Arc::new(MockItem::new(25, 27)));
assert_search(&tree, 22, &["22-24"]);
assert_search(&tree, 23, &["22-24", "23-25"]);
assert_search(&tree, 24, &["23-25", "24-26"]);
assert_search(&tree, 25, &["24-26", "25-27"]);
assert_search(&tree, 26, &["25-27", "26-28"]);
assert_search(&tree, 27, &["26-28"]);
assert_search(&tree, 28, &["26-28"]);
assert_search(&tree, 29, &["26-28"]);
tree.remove(&x24_26);
tree.remove(&x26_28);
assert_search(&tree, 23, &["22-24", "23-25"]);
assert_search(&tree, 24, &["23-25"]);
assert_search(&tree, 25, &["25-27"]);
assert_search(&tree, 26, &["25-27"]);
assert_search(&tree, 27, &["25-27"]);
assert_search(&tree, 28, &["25-27"]);
assert_search(&tree, 29, &["25-27"]);
}
#[test]
fn test_interval_tree_nested() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Items containing other items
tree.insert(Arc::new(MockItem::new(31, 39)));
tree.insert(Arc::new(MockItem::new(32, 34)));
tree.insert(Arc::new(MockItem::new(33, 35)));
tree.insert(Arc::new(MockItem::new(30, 40)));
assert_search(&tree, 30, &["30-40"]);
assert_search(&tree, 31, &["30-40", "31-39"]);
assert_search(&tree, 32, &["30-40", "32-34", "31-39"]);
assert_search(&tree, 33, &["30-40", "32-34", "33-35", "31-39"]);
assert_search(&tree, 34, &["30-40", "33-35", "31-39"]);
assert_search(&tree, 35, &["30-40", "31-39"]);
assert_search(&tree, 36, &["30-40", "31-39"]);
assert_search(&tree, 37, &["30-40", "31-39"]);
assert_search(&tree, 38, &["30-40", "31-39"]);
assert_search(&tree, 39, &["30-40"]);
assert_search(&tree, 40, &["30-40"]);
assert_search(&tree, 41, &["30-40"]);
}
#[test]
fn test_interval_tree_duplicates() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Duplicate keys
let item_a = Arc::new(MockItem::new_str(55, 56, "a"));
tree.insert(Arc::clone(&item_a));
let item_b = Arc::new(MockItem::new_str(55, 56, "b"));
tree.insert(Arc::clone(&item_b));
let item_c = Arc::new(MockItem::new_str(55, 56, "c"));
tree.insert(Arc::clone(&item_c));
let item_d = Arc::new(MockItem::new_str(54, 56, "d"));
tree.insert(Arc::clone(&item_d));
let item_e = Arc::new(MockItem::new_str(55, 57, "e"));
tree.insert(Arc::clone(&item_e));
dump_tree(&tree);
assert_search(
&tree,
55,
&["55-56: a", "55-56: b", "55-56: c", "54-56: d", "55-57: e"],
);
tree.remove(&item_b);
dump_tree(&tree);
assert_contents(&tree, &["54-56: d", "55-56: a", "55-56: c", "55-57: e"]);
tree.remove(&item_d);
dump_tree(&tree);
assert_contents(&tree, &["55-56: a", "55-56: c", "55-57: e"]);
}
#[test]
#[should_panic]
fn test_interval_tree_insert_twice() {
let mut tree: IntervalTree<MockItem> = IntervalTree::default();
// Inserting the same item twice is not cool
let item = Arc::new(MockItem::new(1, 2));
tree.insert(Arc::clone(&item));
tree.insert(Arc::clone(&item)); // fails assertion
}
}

View File

@@ -9,20 +9,20 @@
//! new image and delta layers and corresponding files are written to disk.
//!
use crate::layered_repository::interval_tree::{IntervalItem, IntervalIter, IntervalTree};
use crate::layered_repository::storage_layer::{Layer, SegmentTag};
use crate::layered_repository::InMemoryLayer;
use crate::relish::*;
use anyhow::Result;
use lazy_static::lazy_static;
use log::*;
use std::cmp::Ordering;
use std::collections::HashSet;
use std::collections::{BTreeMap, BinaryHeap, HashMap};
use std::ops::Bound::Included;
use std::collections::{BinaryHeap, HashMap};
use std::sync::Arc;
use zenith_metrics::{register_int_gauge, IntGauge};
use zenith_utils::lsn::Lsn;
use super::global_layer_map::{LayerId, GLOBAL_LAYER_MAP};
lazy_static! {
static ref NUM_INMEMORY_LAYERS: IntGauge =
register_int_gauge!("pageserver_inmemory_layers", "Number of layers in memory")
@@ -35,68 +35,21 @@ lazy_static! {
///
/// LayerMap tracks what layers exist on a timeline.
///
#[derive(Default)]
pub struct LayerMap {
/// All the layers keyed by segment tag
segs: HashMap<SegmentTag, SegEntry>,
/// All in-memory layers, ordered by 'oldest_pending_lsn' of each layer.
/// This allows easy access to the in-memory layer that contains the
/// oldest WAL record.
open_segs: BinaryHeap<OpenSegEntry>,
/// All in-memory layers, ordered by 'oldest_pending_lsn' and generation
/// of each layer. This allows easy access to the in-memory layer that
/// contains the oldest WAL record.
open_layers: BinaryHeap<OpenLayerEntry>,
/// Generation number, used to distinguish newly inserted entries in the
/// binary heap from older entries during checkpoint.
current_generation: u64,
}
///
/// Per-segment entry in the LayerMap.segs hash map
///
/// The last layer that is open for writes is always an InMemoryLayer,
/// and is kept in a separate field, because there can be only one for
/// each segment. The older layers, stored on disk, are kept in a
/// BTreeMap keyed by the layer's start LSN.
struct SegEntry {
pub open: Option<Arc<InMemoryLayer>>,
pub historic: BTreeMap<Lsn, Arc<dyn Layer>>,
}
/// Entry held LayerMap.open_segs, with boilerplate comparison
/// routines to implement a min-heap ordered by 'oldest_pending_lsn'
///
/// Each entry also carries a generation number. It can be used to distinguish
/// entries with the same 'oldest_pending_lsn'.
struct OpenSegEntry {
pub oldest_pending_lsn: Lsn,
pub layer: Arc<InMemoryLayer>,
pub generation: u64,
}
impl Ord for OpenSegEntry {
fn cmp(&self, other: &Self) -> Ordering {
// BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here
// to get that.
other.oldest_pending_lsn.cmp(&self.oldest_pending_lsn)
}
}
impl PartialOrd for OpenSegEntry {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
// BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here
// to get that. Entries with identical oldest_pending_lsn are ordered by generation
Some(
other
.oldest_pending_lsn
.cmp(&self.oldest_pending_lsn)
.then_with(|| other.generation.cmp(&self.generation)),
)
}
}
impl PartialEq for OpenSegEntry {
fn eq(&self, other: &Self) -> bool {
self.oldest_pending_lsn.eq(&other.oldest_pending_lsn)
}
}
impl Eq for OpenSegEntry {}
impl LayerMap {
///
/// Look up a layer using the given segment tag and LSN. This differs from a
@@ -107,23 +60,7 @@ impl LayerMap {
pub fn get(&self, tag: &SegmentTag, lsn: Lsn) -> Option<Arc<dyn Layer>> {
let segentry = self.segs.get(tag)?;
if let Some(open) = &segentry.open {
if open.get_start_lsn() <= lsn {
let x: Arc<dyn Layer> = Arc::clone(open) as _;
return Some(x);
}
}
if let Some((_k, v)) = segentry
.historic
.range((Included(Lsn(0)), Included(lsn)))
.next_back()
{
let x: Arc<dyn Layer> = Arc::clone(v) as _;
Some(x)
} else {
None
}
segentry.get(lsn)
}
///
@@ -132,67 +69,76 @@ impl LayerMap {
///
pub fn get_open(&self, tag: &SegmentTag) -> Option<Arc<InMemoryLayer>> {
let segentry = self.segs.get(tag)?;
segentry.open.as_ref().map(Arc::clone)
segentry
.open_layer_id
.and_then(|layer_id| GLOBAL_LAYER_MAP.read().unwrap().get(&layer_id))
}
///
/// Insert an open in-memory layer
///
pub fn insert_open(&mut self, layer: Arc<InMemoryLayer>) {
let tag = layer.get_seg_tag();
let segentry = self.segs.entry(layer.get_seg_tag()).or_default();
if let Some(segentry) = self.segs.get_mut(&tag) {
if let Some(_old) = &segentry.open {
// FIXME: shouldn't exist, but check
}
segentry.open = Some(Arc::clone(&layer));
} else {
let segentry = SegEntry {
open: Some(Arc::clone(&layer)),
historic: BTreeMap::new(),
};
self.segs.insert(tag, segentry);
}
let layer_id = segentry.update_open(Arc::clone(&layer));
let opensegentry = OpenSegEntry {
let oldest_pending_lsn = layer.get_oldest_pending_lsn();
// After a crash and restart, 'oldest_pending_lsn' of the oldest in-memory
// layer becomes the WAL streaming starting point, so it better not point
// in the middle of a WAL record.
assert!(oldest_pending_lsn.is_aligned());
// Also add it to the binary heap
let open_layer_entry = OpenLayerEntry {
oldest_pending_lsn: layer.get_oldest_pending_lsn(),
layer,
layer_id,
generation: self.current_generation,
};
self.open_segs.push(opensegentry);
self.open_layers.push(open_layer_entry);
NUM_INMEMORY_LAYERS.inc();
}
/// Remove the oldest in-memory layer
pub fn pop_oldest_open(&mut self) {
let opensegentry = self.open_segs.pop().unwrap();
let segtag = opensegentry.layer.get_seg_tag();
/// Remove an open in-memory layer
pub fn remove_open(&mut self, layer_id: LayerId) {
// Note: we don't try to remove the entry from the binary heap.
// It will be removed lazily by peek_oldest_open() when it's made it to
// the top of the heap.
let mut segentry = self.segs.get_mut(&segtag).unwrap();
segentry.open = None;
NUM_INMEMORY_LAYERS.dec();
let layer_opt = {
let mut global_map = GLOBAL_LAYER_MAP.write().unwrap();
let layer_opt = global_map.get(&layer_id);
global_map.remove(&layer_id);
// TODO it's bad that a ref can still exist after being evicted from cache
layer_opt
};
if let Some(layer) = layer_opt {
let mut segentry = self.segs.get_mut(&layer.get_seg_tag()).unwrap();
if segentry.open_layer_id == Some(layer_id) {
// Also remove it from the SegEntry of this segment
segentry.open_layer_id = None;
} else {
// We could have already updated segentry.open for
// dropped (non-writeable) layer. This is fine.
assert!(!layer.is_writeable());
assert!(layer.is_dropped());
}
NUM_INMEMORY_LAYERS.dec();
}
}
///
/// Insert an on-disk layer
///
pub fn insert_historic(&mut self, layer: Arc<dyn Layer>) {
let tag = layer.get_seg_tag();
let start_lsn = layer.get_start_lsn();
let segentry = self.segs.entry(layer.get_seg_tag()).or_default();
segentry.insert_historic(layer);
if let Some(segentry) = self.segs.get_mut(&tag) {
segentry.historic.insert(start_lsn, layer);
} else {
let mut historic = BTreeMap::new();
historic.insert(start_lsn, layer);
let segentry = SegEntry {
open: None,
historic,
};
self.segs.insert(tag, segentry);
}
NUM_ONDISK_LAYERS.inc();
}
@@ -201,62 +147,40 @@ impl LayerMap {
///
/// This should be called when the corresponding file on disk has been deleted.
///
pub fn remove_historic(&mut self, layer: &dyn Layer) {
pub fn remove_historic(&mut self, layer: Arc<dyn Layer>) {
let tag = layer.get_seg_tag();
let start_lsn = layer.get_start_lsn();
if let Some(segentry) = self.segs.get_mut(&tag) {
segentry.historic.remove(&start_lsn);
segentry.historic.remove(&layer);
}
NUM_ONDISK_LAYERS.dec();
}
// List relations that exist at the lsn
pub fn list_rels(&self, spcnode: u32, dbnode: u32, lsn: Lsn) -> Result<HashSet<RelTag>> {
let mut rels: HashSet<RelTag> = HashSet::new();
// List relations along with a flag that marks if they exist at the given lsn.
// spcnode 0 and dbnode 0 have special meanings and mean all tabespaces/databases.
// Pass Tag if we're only interested in some relations.
pub fn list_relishes(&self, tag: Option<RelTag>, lsn: Lsn) -> Result<HashMap<RelishTag, bool>> {
let mut rels: HashMap<RelishTag, bool> = HashMap::new();
for (seg, segentry) in self.segs.iter() {
if let RelishTag::Relation(reltag) = seg.rel {
if (spcnode == 0 || reltag.spcnode == spcnode)
&& (dbnode == 0 || reltag.dbnode == dbnode)
{
// Add only if it exists at the requested LSN.
if let Some(open) = &segentry.open {
if open.get_end_lsn() > lsn {
rels.insert(reltag);
match seg.rel {
RelishTag::Relation(reltag) => {
if let Some(request_rel) = tag {
if (request_rel.spcnode == 0 || reltag.spcnode == request_rel.spcnode)
&& (request_rel.dbnode == 0 || reltag.dbnode == request_rel.dbnode)
{
if let Some(exists) = segentry.exists_at_lsn(lsn)? {
rels.insert(seg.rel, exists);
}
}
} else if let Some((_k, _v)) = segentry
.historic
.range((Included(Lsn(0)), Included(lsn)))
.next_back()
{
rels.insert(reltag);
}
}
}
}
Ok(rels)
}
// List non-relation relishes that exist at the lsn
pub fn list_nonrels(&self, lsn: Lsn) -> Result<HashSet<RelishTag>> {
let mut rels: HashSet<RelishTag> = HashSet::new();
// Scan the timeline directory to get all rels in this timeline.
for (seg, segentry) in self.segs.iter() {
if let RelishTag::Relation(_) = seg.rel {
} else {
// Add only if it exists at the requested LSN.
if let Some(open) = &segentry.open {
if open.get_end_lsn() > lsn {
rels.insert(seg.rel);
_ => {
if tag == None {
if let Some(exists) = segentry.exists_at_lsn(lsn)? {
rels.insert(seg.rel, exists);
}
}
} else if let Some((_k, _v)) = segentry
.historic
.range((Included(Lsn(0)), Included(lsn)))
.next_back()
{
rels.insert(seg.rel);
}
}
}
@@ -269,42 +193,38 @@ impl LayerMap {
/// be deleted.
pub fn newer_image_layer_exists(&self, seg: SegmentTag, lsn: Lsn) -> bool {
if let Some(segentry) = self.segs.get(&seg) {
// We only check on-disk layers, because
// in-memory layers are not durable
for (newer_lsn, layer) in segentry
.historic
.range((Included(lsn), Included(Lsn(u64::MAX))))
{
// Ignore layers that depend on an older layer.
if layer.is_incremental() {
continue;
}
if layer.get_end_lsn() > lsn {
trace!(
"found later layer for {}, {} {}-{}",
seg,
lsn,
newer_lsn,
layer.get_end_lsn()
);
return true;
} else {
trace!("found singleton layer for {}, {} {}", seg, lsn, newer_lsn);
continue;
}
}
segentry.newer_image_layer_exists(lsn)
} else {
false
}
trace!("no later layer found for {}, {}", seg, lsn);
false
}
/// Is there any layer for given segment that is alive at the lsn?
///
/// This is a public wrapper for SegEntry fucntion,
/// used for garbage collection, to determine if some alive layer
/// exists at the lsn. If so, we shouldn't delete a newer dropped layer
/// to avoid incorrectly making it visible.
pub fn layer_exists_at_lsn(&self, seg: SegmentTag, lsn: Lsn) -> Result<bool> {
Ok(if let Some(segentry) = self.segs.get(&seg) {
segentry.exists_at_lsn(lsn)?.unwrap_or(false)
} else {
false
})
}
/// Return the oldest in-memory layer, along with its generation number.
pub fn peek_oldest_open(&self) -> Option<(Arc<InMemoryLayer>, u64)> {
if let Some(opensegentry) = self.open_segs.peek() {
Some((Arc::clone(&opensegentry.layer), opensegentry.generation))
} else {
None
pub fn peek_oldest_open(&mut self) -> Option<(LayerId, Arc<InMemoryLayer>, u64)> {
let global_map = GLOBAL_LAYER_MAP.read().unwrap();
while let Some(oldest_entry) = self.open_layers.peek() {
if let Some(layer) = global_map.get(&oldest_entry.layer_id) {
return Some((oldest_entry.layer_id, layer, oldest_entry.generation));
} else {
self.open_layers.pop();
}
}
None
}
/// Increment the generation number used to stamp open in-memory layers. Layers
@@ -317,7 +237,7 @@ impl LayerMap {
pub fn iter_historic_layers(&self) -> HistoricLayerIter {
HistoricLayerIter {
segiter: self.segs.iter(),
seg_iter: self.segs.iter(),
iter: None,
}
}
@@ -327,11 +247,15 @@ impl LayerMap {
pub fn dump(&self) -> Result<()> {
println!("Begin dump LayerMap");
for (seg, segentry) in self.segs.iter() {
if let Some(open) = &segentry.open {
open.dump()?;
if let Some(open) = &segentry.open_layer_id {
if let Some(layer) = GLOBAL_LAYER_MAP.read().unwrap().get(open) {
layer.dump()?;
} else {
println!("layer not found in global map");
}
}
for (_, layer) in segentry.historic.iter() {
for layer in segentry.historic.iter() {
layer.dump()?;
}
}
@@ -340,19 +264,119 @@ impl LayerMap {
}
}
impl Default for LayerMap {
fn default() -> Self {
LayerMap {
segs: HashMap::new(),
open_segs: BinaryHeap::new(),
current_generation: 0,
}
impl IntervalItem for dyn Layer {
type Key = Lsn;
fn start_key(&self) -> Lsn {
self.get_start_lsn()
}
fn end_key(&self) -> Lsn {
self.get_end_lsn()
}
}
///
/// Per-segment entry in the LayerMap::segs hash map. Holds all the layers
/// associated with the segment.
///
/// The last layer that is open for writes is always an InMemoryLayer,
/// and is kept in a separate field, because there can be only one for
/// each segment. The older layers, stored on disk, are kept in an
/// IntervalTree.
#[derive(Default)]
struct SegEntry {
open_layer_id: Option<LayerId>,
historic: IntervalTree<dyn Layer>,
}
impl SegEntry {
/// Does the segment exist at given LSN?
/// Return None if object is not found in this SegEntry.
fn exists_at_lsn(&self, lsn: Lsn) -> Result<Option<bool>> {
if let Some(layer) = self.get(lsn) {
Ok(Some(layer.get_seg_exists(lsn)?))
} else {
Ok(None)
}
}
pub fn get(&self, lsn: Lsn) -> Option<Arc<dyn Layer>> {
if let Some(open_layer_id) = &self.open_layer_id {
let open_layer = GLOBAL_LAYER_MAP.read().unwrap().get(open_layer_id)?;
if open_layer.get_start_lsn() <= lsn {
return Some(open_layer);
}
}
self.historic.search(lsn)
}
pub fn newer_image_layer_exists(&self, lsn: Lsn) -> bool {
// We only check on-disk layers, because
// in-memory layers are not durable
self.historic
.iter_newer(lsn)
.any(|layer| !layer.is_incremental())
}
// Set new open layer for a SegEntry.
// It's ok to rewrite previous open layer,
// but only if it is not writeable anymore.
pub fn update_open(&mut self, layer: Arc<InMemoryLayer>) -> LayerId {
if let Some(prev_open_layer_id) = &self.open_layer_id {
if let Some(prev_open_layer) = GLOBAL_LAYER_MAP.read().unwrap().get(prev_open_layer_id)
{
assert!(!prev_open_layer.is_writeable());
}
}
let open_layer_id = GLOBAL_LAYER_MAP.write().unwrap().insert(layer);
self.open_layer_id = Some(open_layer_id);
open_layer_id
}
pub fn insert_historic(&mut self, layer: Arc<dyn Layer>) {
self.historic.insert(layer);
}
}
/// Entry held in LayerMap::open_layers, with boilerplate comparison routines
/// to implement a min-heap ordered by 'oldest_pending_lsn' and 'generation'
///
/// The generation number associated with each entry can be used to distinguish
/// recently-added entries (i.e after last call to increment_generation()) from older
/// entries with the same 'oldest_pending_lsn'.
struct OpenLayerEntry {
oldest_pending_lsn: Lsn, // copy of layer.get_oldest_pending_lsn()
generation: u64,
layer_id: LayerId,
}
impl Ord for OpenLayerEntry {
fn cmp(&self, other: &Self) -> Ordering {
// BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here
// to get that. Entries with identical oldest_pending_lsn are ordered by generation
other
.oldest_pending_lsn
.cmp(&self.oldest_pending_lsn)
.then_with(|| other.generation.cmp(&self.generation))
}
}
impl PartialOrd for OpenLayerEntry {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for OpenLayerEntry {
fn eq(&self, other: &Self) -> bool {
self.cmp(other) == Ordering::Equal
}
}
impl Eq for OpenLayerEntry {}
/// Iterator returned by LayerMap::iter_historic_layers()
pub struct HistoricLayerIter<'a> {
segiter: std::collections::hash_map::Iter<'a, SegmentTag, SegEntry>,
iter: Option<std::collections::btree_map::Iter<'a, Lsn, Arc<dyn Layer>>>,
seg_iter: std::collections::hash_map::Iter<'a, SegmentTag, SegEntry>,
iter: Option<IntervalIter<'a, dyn Layer>>,
}
impl<'a> Iterator for HistoricLayerIter<'a> {
@@ -362,11 +386,11 @@ impl<'a> Iterator for HistoricLayerIter<'a> {
loop {
if let Some(x) = &mut self.iter {
if let Some(x) = x.next() {
return Some(Arc::clone(&*x.1));
return Some(Arc::clone(&x));
}
}
if let Some(seg) = self.segiter.next() {
self.iter = Some(seg.1.historic.iter());
if let Some((_tag, segentry)) = self.seg_iter.next() {
self.iter = Some(segentry.historic.iter());
continue;
} else {
return None;
@@ -374,3 +398,86 @@ impl<'a> Iterator for HistoricLayerIter<'a> {
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::PageServerConf;
use std::str::FromStr;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
/// Arbitrary relation tag, for testing.
const TESTREL_A: RelishTag = RelishTag::Relation(RelTag {
spcnode: 0,
dbnode: 111,
relnode: 1000,
forknum: 0,
});
lazy_static! {
static ref DUMMY_TIMELINEID: ZTimelineId =
ZTimelineId::from_str("00000000000000000000000000000000").unwrap();
static ref DUMMY_TENANTID: ZTenantId =
ZTenantId::from_str("00000000000000000000000000000000").unwrap();
}
/// Construct a dummy InMemoryLayer for testing
fn dummy_inmem_layer(
conf: &'static PageServerConf,
segno: u32,
start_lsn: Lsn,
oldest_pending_lsn: Lsn,
) -> Arc<InMemoryLayer> {
Arc::new(
InMemoryLayer::create(
conf,
*DUMMY_TIMELINEID,
*DUMMY_TENANTID,
SegmentTag {
rel: TESTREL_A,
segno,
},
start_lsn,
oldest_pending_lsn,
)
.unwrap(),
)
}
#[test]
fn test_open_layers() -> Result<()> {
let conf = PageServerConf::dummy_conf(PageServerConf::test_repo_dir("dummy_inmem_layer"));
let conf = Box::leak(Box::new(conf));
std::fs::create_dir_all(conf.timeline_path(&DUMMY_TIMELINEID, &DUMMY_TENANTID))?;
let mut layers = LayerMap::default();
let gen1 = layers.increment_generation();
layers.insert_open(dummy_inmem_layer(conf, 0, Lsn(0x100), Lsn(0x100)));
layers.insert_open(dummy_inmem_layer(conf, 1, Lsn(0x100), Lsn(0x200)));
layers.insert_open(dummy_inmem_layer(conf, 2, Lsn(0x100), Lsn(0x120)));
layers.insert_open(dummy_inmem_layer(conf, 3, Lsn(0x100), Lsn(0x110)));
let gen2 = layers.increment_generation();
layers.insert_open(dummy_inmem_layer(conf, 4, Lsn(0x100), Lsn(0x110)));
layers.insert_open(dummy_inmem_layer(conf, 5, Lsn(0x100), Lsn(0x100)));
// A helper function (closure) to pop the next oldest open entry from the layer map,
// and assert that it is what we'd expect
let mut assert_pop_layer = |expected_segno: u32, expected_generation: u64| {
let (layer_id, l, generation) = layers.peek_oldest_open().unwrap();
assert!(l.get_seg_tag().segno == expected_segno);
assert!(generation == expected_generation);
layers.remove_open(layer_id);
};
assert_pop_layer(0, gen1); // 0x100
assert_pop_layer(5, gen2); // 0x100
assert_pop_layer(3, gen1); // 0x110
assert_pop_layer(4, gen2); // 0x110
assert_pop_layer(2, gen1); // 0x120
assert_pop_layer(1, gen1); // 0x200
Ok(())
}
}

View File

@@ -0,0 +1,226 @@
//! Every image of a certain timeline from [`crate::layered_repository::LayeredRepository`]
//! has a metadata that needs to be stored persistently.
//!
//! Later, the file gets is used in [`crate::remote_storage::storage_sync`] as a part of
//! external storage import and export operations.
//!
//! The module contains all structs and related helper methods related to timeline metadata.
use std::{convert::TryInto, path::PathBuf};
use anyhow::ensure;
use zenith_utils::{
bin_ser::BeSer,
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
use crate::{
layered_repository::{METADATA_CHECKSUM_SIZE, METADATA_MAX_DATA_SIZE, METADATA_MAX_SAFE_SIZE},
PageServerConf,
};
/// The name of the metadata file pageserver creates per timeline.
pub const METADATA_FILE_NAME: &str = "metadata";
/// Metadata stored on disk for each timeline
///
/// The fields correspond to the values we hold in memory, in LayeredTimeline.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub struct TimelineMetadata {
disk_consistent_lsn: Lsn,
// This is only set if we know it. We track it in memory when the page
// server is running, but we only track the value corresponding to
// 'last_record_lsn', not 'disk_consistent_lsn' which can lag behind by a
// lot. We only store it in the metadata file when we flush *all* the
// in-memory data so that 'last_record_lsn' is the same as
// 'disk_consistent_lsn'. That's OK, because after page server restart, as
// soon as we reprocess at least one record, we will have a valid
// 'prev_record_lsn' value in memory again. This is only really needed when
// doing a clean shutdown, so that there is no more WAL beyond
// 'disk_consistent_lsn'
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<ZTimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
}
/// Points to a place in pageserver's local directory,
/// where certain timeline's metadata file should be located.
pub fn metadata_path(
conf: &'static PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
) -> PathBuf {
conf.timeline_path(&timelineid, &tenantid)
.join(METADATA_FILE_NAME)
}
impl TimelineMetadata {
pub fn new(
disk_consistent_lsn: Lsn,
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<ZTimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
) -> Self {
Self {
disk_consistent_lsn,
prev_record_lsn,
ancestor_timeline,
ancestor_lsn,
latest_gc_cutoff_lsn,
initdb_lsn,
}
}
pub fn from_bytes(metadata_bytes: &[u8]) -> anyhow::Result<Self> {
ensure!(
metadata_bytes.len() == METADATA_MAX_SAFE_SIZE,
"metadata bytes size is wrong"
);
let data = &metadata_bytes[..METADATA_MAX_DATA_SIZE];
let calculated_checksum = crc32c::crc32c(data);
let checksum_bytes: &[u8; METADATA_CHECKSUM_SIZE] =
metadata_bytes[METADATA_MAX_DATA_SIZE..].try_into()?;
let expected_checksum = u32::from_le_bytes(*checksum_bytes);
ensure!(
calculated_checksum == expected_checksum,
"metadata checksum mismatch"
);
let data = TimelineMetadata::from(serialize::DeTimelineMetadata::des_prefix(data)?);
assert!(data.disk_consistent_lsn.is_aligned());
Ok(data)
}
pub fn to_bytes(&self) -> anyhow::Result<Vec<u8>> {
let serializeable_metadata = serialize::SeTimelineMetadata::from(self);
let mut metadata_bytes = serialize::SeTimelineMetadata::ser(&serializeable_metadata)?;
assert!(metadata_bytes.len() <= METADATA_MAX_DATA_SIZE);
metadata_bytes.resize(METADATA_MAX_SAFE_SIZE, 0u8);
let checksum = crc32c::crc32c(&metadata_bytes[..METADATA_MAX_DATA_SIZE]);
metadata_bytes[METADATA_MAX_DATA_SIZE..].copy_from_slice(&u32::to_le_bytes(checksum));
Ok(metadata_bytes)
}
/// [`Lsn`] that corresponds to the corresponding timeline directory
/// contents, stored locally in the pageserver workdir.
pub fn disk_consistent_lsn(&self) -> Lsn {
self.disk_consistent_lsn
}
pub fn prev_record_lsn(&self) -> Option<Lsn> {
self.prev_record_lsn
}
pub fn ancestor_timeline(&self) -> Option<ZTimelineId> {
self.ancestor_timeline
}
pub fn ancestor_lsn(&self) -> Lsn {
self.ancestor_lsn
}
pub fn latest_gc_cutoff_lsn(&self) -> Lsn {
self.latest_gc_cutoff_lsn
}
pub fn initdb_lsn(&self) -> Lsn {
self.initdb_lsn
}
}
/// This module is for direct conversion of metadata to bytes and back.
/// For a certain metadata, besides the conversion a few verification steps has to
/// be done, so all serde derives are hidden from the user, to avoid accidental
/// verification-less metadata creation.
mod serialize {
use serde::{Deserialize, Serialize};
use zenith_utils::{lsn::Lsn, zid::ZTimelineId};
use super::TimelineMetadata;
#[derive(Serialize)]
pub(super) struct SeTimelineMetadata<'a> {
disk_consistent_lsn: &'a Lsn,
prev_record_lsn: &'a Option<Lsn>,
ancestor_timeline: &'a Option<ZTimelineId>,
ancestor_lsn: &'a Lsn,
latest_gc_cutoff_lsn: &'a Lsn,
initdb_lsn: &'a Lsn,
}
impl<'a> From<&'a TimelineMetadata> for SeTimelineMetadata<'a> {
fn from(other: &'a TimelineMetadata) -> Self {
Self {
disk_consistent_lsn: &other.disk_consistent_lsn,
prev_record_lsn: &other.prev_record_lsn,
ancestor_timeline: &other.ancestor_timeline,
ancestor_lsn: &other.ancestor_lsn,
latest_gc_cutoff_lsn: &other.latest_gc_cutoff_lsn,
initdb_lsn: &other.initdb_lsn,
}
}
}
#[derive(Deserialize)]
pub(super) struct DeTimelineMetadata {
disk_consistent_lsn: Lsn,
prev_record_lsn: Option<Lsn>,
ancestor_timeline: Option<ZTimelineId>,
ancestor_lsn: Lsn,
latest_gc_cutoff_lsn: Lsn,
initdb_lsn: Lsn,
}
impl From<DeTimelineMetadata> for TimelineMetadata {
fn from(other: DeTimelineMetadata) -> Self {
Self {
disk_consistent_lsn: other.disk_consistent_lsn,
prev_record_lsn: other.prev_record_lsn,
ancestor_timeline: other.ancestor_timeline,
ancestor_lsn: other.ancestor_lsn,
latest_gc_cutoff_lsn: other.latest_gc_cutoff_lsn,
initdb_lsn: other.initdb_lsn,
}
}
}
}
#[cfg(test)]
mod tests {
use crate::repository::repo_harness::TIMELINE_ID;
use super::*;
#[test]
fn metadata_serializes_correctly() {
let original_metadata = TimelineMetadata {
disk_consistent_lsn: Lsn(0x200),
prev_record_lsn: Some(Lsn(0x100)),
ancestor_timeline: Some(TIMELINE_ID),
ancestor_lsn: Lsn(0),
latest_gc_cutoff_lsn: Lsn(0),
initdb_lsn: Lsn(0),
};
let metadata_bytes = original_metadata
.to_bytes()
.expect("Should serialize correct metadata to bytes");
let deserialized_metadata = TimelineMetadata::from_bytes(&metadata_bytes)
.expect("Should deserialize its own bytes");
assert_eq!(
deserialized_metadata, original_metadata,
"Metadata that was serialized to bytes and deserialized back should not change"
);
}
}

View File

@@ -0,0 +1,252 @@
//!
//! Data structure to ingest incoming WAL into an append-only file.
//!
//! - The file is considered temporary, and will be discarded on crash
//! - based on a B-tree
//!
use std::os::unix::fs::FileExt;
use std::{collections::HashMap, ops::RangeBounds, slice};
use anyhow::Result;
use std::cmp::min;
use std::io::Seek;
use zenith_utils::{lsn::Lsn, vec_map::VecMap};
use super::storage_layer::PageVersion;
use crate::layered_repository::ephemeral_file::EphemeralFile;
use zenith_utils::bin_ser::BeSer;
const EMPTY_SLICE: &[(Lsn, u64)] = &[];
pub struct PageVersions {
map: HashMap<u32, VecMap<Lsn, u64>>,
/// The PageVersion structs are stored in a serialized format in this file.
/// Each serialized PageVersion is preceded by a 'u32' length field.
/// The 'map' stores offsets into this file.
file: EphemeralFile,
}
impl PageVersions {
pub fn new(file: EphemeralFile) -> PageVersions {
PageVersions {
map: HashMap::new(),
file,
}
}
pub fn append_or_update_last(
&mut self,
blknum: u32,
lsn: Lsn,
page_version: PageVersion,
) -> Result<Option<u64>> {
// remember starting position
let pos = self.file.stream_position()?;
// make room for the 'length' field by writing zeros as a placeholder.
self.file.seek(std::io::SeekFrom::Start(pos + 4)).unwrap();
page_version.ser_into(&mut self.file).unwrap();
// write the 'length' field.
let len = self.file.stream_position()? - pos - 4;
let lenbuf = u32::to_ne_bytes(len as u32);
self.file.write_all_at(&lenbuf, pos)?;
let map = self.map.entry(blknum).or_insert_with(VecMap::default);
Ok(map.append_or_update_last(lsn, pos as u64).unwrap().0)
}
/// Get all [`PageVersion`]s in a block
fn get_block_slice(&self, blknum: u32) -> &[(Lsn, u64)] {
self.map
.get(&blknum)
.map(VecMap::as_slice)
.unwrap_or(EMPTY_SLICE)
}
/// Get a range of [`PageVersions`] in a block
pub fn get_block_lsn_range<R: RangeBounds<Lsn>>(&self, blknum: u32, range: R) -> &[(Lsn, u64)] {
self.map
.get(&blknum)
.map(|vec_map| vec_map.slice_range(range))
.unwrap_or(EMPTY_SLICE)
}
/// Iterate through [`PageVersion`]s in (block, lsn) order.
/// If a [`cutoff_lsn`] is set, only show versions with `lsn < cutoff_lsn`
pub fn ordered_page_version_iter(&self, cutoff_lsn: Option<Lsn>) -> OrderedPageVersionIter<'_> {
let mut ordered_blocks: Vec<u32> = self.map.keys().cloned().collect();
ordered_blocks.sort_unstable();
let slice = ordered_blocks
.first()
.map(|&blknum| self.get_block_slice(blknum))
.unwrap_or(EMPTY_SLICE);
OrderedPageVersionIter {
page_versions: self,
ordered_blocks,
cur_block_idx: 0,
cutoff_lsn,
cur_slice_iter: slice.iter(),
}
}
/// Returns a 'Read' that reads the page version at given offset.
pub fn reader(&self, pos: u64) -> Result<PageVersionReader, std::io::Error> {
// read length
let mut lenbuf = [0u8; 4];
self.file.read_exact_at(&mut lenbuf, pos)?;
let len = u32::from_ne_bytes(lenbuf);
Ok(PageVersionReader {
file: &self.file,
pos: pos + 4,
end_pos: pos + 4 + len as u64,
})
}
pub fn get_page_version(&self, pos: u64) -> Result<PageVersion> {
let mut reader = self.reader(pos)?;
Ok(PageVersion::des_from(&mut reader)?)
}
}
pub struct PageVersionReader<'a> {
file: &'a EphemeralFile,
pos: u64,
end_pos: u64,
}
impl<'a> std::io::Read for PageVersionReader<'a> {
fn read(&mut self, buf: &mut [u8]) -> Result<usize, std::io::Error> {
let len = min(buf.len(), (self.end_pos - self.pos) as usize);
let n = self.file.read_at(&mut buf[..len], self.pos)?;
self.pos += n as u64;
Ok(n)
}
}
pub struct OrderedPageVersionIter<'a> {
page_versions: &'a PageVersions,
ordered_blocks: Vec<u32>,
cur_block_idx: usize,
cutoff_lsn: Option<Lsn>,
cur_slice_iter: slice::Iter<'a, (Lsn, u64)>,
}
impl OrderedPageVersionIter<'_> {
fn is_lsn_before_cutoff(&self, lsn: &Lsn) -> bool {
if let Some(cutoff_lsn) = self.cutoff_lsn.as_ref() {
lsn < cutoff_lsn
} else {
true
}
}
}
impl<'a> Iterator for OrderedPageVersionIter<'a> {
type Item = (u32, Lsn, u64);
fn next(&mut self) -> Option<Self::Item> {
loop {
if let Some((lsn, pos)) = self.cur_slice_iter.next() {
if self.is_lsn_before_cutoff(lsn) {
let blknum = self.ordered_blocks[self.cur_block_idx];
return Some((blknum, *lsn, *pos));
}
}
let next_block_idx = self.cur_block_idx + 1;
let blknum: u32 = *self.ordered_blocks.get(next_block_idx)?;
self.cur_block_idx = next_block_idx;
self.cur_slice_iter = self.page_versions.get_block_slice(blknum).iter();
}
}
}
#[cfg(test)]
mod tests {
use bytes::Bytes;
use super::*;
use crate::PageServerConf;
use std::fs;
use std::str::FromStr;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
fn repo_harness(test_name: &str) -> Result<(&'static PageServerConf, ZTenantId, ZTimelineId)> {
let repo_dir = PageServerConf::test_repo_dir(test_name);
let _ = fs::remove_dir_all(&repo_dir);
let conf = PageServerConf::dummy_conf(repo_dir);
// Make a static copy of the config. This can never be free'd, but that's
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
let tenantid = ZTenantId::from_str("11000000000000000000000000000000").unwrap();
let timelineid = ZTimelineId::from_str("22000000000000000000000000000000").unwrap();
fs::create_dir_all(conf.timeline_path(&timelineid, &tenantid))?;
Ok((conf, tenantid, timelineid))
}
#[test]
fn test_ordered_iter() -> Result<()> {
let (conf, tenantid, timelineid) = repo_harness("test_ordered_iter")?;
let file = EphemeralFile::create(conf, tenantid, timelineid)?;
let mut page_versions = PageVersions::new(file);
const BLOCKS: u32 = 1000;
const LSNS: u64 = 50;
let empty_page = Bytes::from_static(&[0u8; 8192]);
let empty_page_version = PageVersion::Page(empty_page);
for blknum in 0..BLOCKS {
for lsn in 0..LSNS {
let old = page_versions.append_or_update_last(
blknum,
Lsn(lsn),
empty_page_version.clone(),
)?;
assert!(old.is_none());
}
}
let mut iter = page_versions.ordered_page_version_iter(None);
for blknum in 0..BLOCKS {
for lsn in 0..LSNS {
let (actual_blknum, actual_lsn, _pv) = iter.next().unwrap();
assert_eq!(actual_blknum, blknum);
assert_eq!(Lsn(lsn), actual_lsn);
}
}
assert!(iter.next().is_none());
assert!(iter.next().is_none()); // should be robust against excessive next() calls
const CUTOFF_LSN: Lsn = Lsn(30);
let mut iter = page_versions.ordered_page_version_iter(Some(CUTOFF_LSN));
for blknum in 0..BLOCKS {
for lsn in 0..CUTOFF_LSN.0 {
let (actual_blknum, actual_lsn, _pv) = iter.next().unwrap();
assert_eq!(actual_blknum, blknum);
assert_eq!(Lsn(lsn), actual_lsn);
}
}
assert!(iter.next().is_none());
assert!(iter.next().is_none()); // should be robust against excessive next() calls
Ok(())
}
}

View File

@@ -4,13 +4,12 @@
use crate::relish::RelishTag;
use crate::repository::WALRecord;
use crate::ZTimelineId;
use crate::{ZTenantId, ZTimelineId};
use anyhow::Result;
use bytes::Bytes;
use serde::{Deserialize, Serialize};
use std::fmt;
use std::path::PathBuf;
use std::sync::Arc;
use zenith_utils::lsn::Lsn;
@@ -21,7 +20,7 @@ pub const RELISH_SEG_SIZE: u32 = 10 * 1024 * 1024 / 8192;
/// Each relish stored in the repository is divided into fixed-sized "segments",
/// with 10 MB of key-space, or 1280 8k pages each.
///
#[derive(Debug, PartialEq, Eq, PartialOrd, Hash, Ord, Clone, Copy)]
#[derive(Debug, PartialEq, Eq, PartialOrd, Hash, Ord, Clone, Copy, Serialize, Deserialize)]
pub struct SegmentTag {
pub rel: RelishTag,
pub segno: u32,
@@ -52,23 +51,10 @@ impl SegmentTag {
///
/// A page version can be stored as a full page image, or as WAL record that needs
/// to be applied over the previous page version to reconstruct this version.
///
/// It's also possible to have both a WAL record and a page image in the same
/// PageVersion. That happens if page version is originally stored as a WAL record
/// but it is later reconstructed by a GetPage@LSN request by performing WAL
/// redo. The get_page_at_lsn() code will store the reconstructed pag image next to
/// the WAL record in that case. TODO: That's pretty accidental, not the result
/// of any grand design. If we want to keep reconstructed page versions around, we
/// probably should have a separate buffer cache so that we could control the
/// replacement policy globally. Or if we keep a reconstructed page image, we
/// could throw away the WAL record.
///
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PageVersion {
/// an 8kb page image
pub page_image: Option<Bytes>,
/// WAL record to get from previous page version to this one.
pub record: Option<WALRecord>,
pub enum PageVersion {
Page(Bytes),
Wal(WALRecord),
}
///
@@ -79,7 +65,7 @@ pub struct PageVersion {
/// 'records' contains the records to apply over the base image.
///
pub struct PageReconstructData {
pub records: Vec<WALRecord>,
pub records: Vec<(Lsn, WALRecord)>,
pub page_img: Option<Bytes>,
}
@@ -87,13 +73,15 @@ pub struct PageReconstructData {
pub enum PageReconstructResult {
/// Got all the data needed to reconstruct the requested page
Complete,
/// This layer didn't contain all the required data, the caller should collect
/// more data from the returned predecessor layer at the returned LSN.
Continue(Lsn, Arc<dyn Layer>),
/// This layer didn't contain all the required data, the caller should look up
/// the predecessor layer at the returned LSN and collect more data from there.
Continue(Lsn),
/// This layer didn't contain data needed to reconstruct the page version at
/// the returned LSN. This is usually considered an error, but might be OK
/// in some circumstances.
Missing(Lsn),
/// Use the cached image at `cached_img_lsn` as the base image
Cached,
}
///
@@ -105,21 +93,22 @@ pub enum PageReconstructResult {
/// in-memory and on-disk layers.
///
pub trait Layer: Send + Sync {
fn get_tenant_id(&self) -> ZTenantId;
/// Identify the timeline this relish belongs to
fn get_timeline_id(&self) -> ZTimelineId;
/// Identify the relish segment
fn get_seg_tag(&self) -> SegmentTag;
/// Inclusive start bound of the LSN range that this layer hold
/// Inclusive start bound of the LSN range that this layer holds
fn get_start_lsn(&self) -> Lsn;
/// 'end_lsn' meaning depends on the layer kind:
/// - in-memory layer is either unbounded (end_lsn = MAX_LSN) or dropped (end_lsn = drop_lsn)
/// - image layer represents snapshot at one LSN, so end_lsn = lsn
/// - delta layer has end_lsn
/// Exclusive end bound of the LSN range that this layer holds.
///
/// TODO Is end_lsn always exclusive for all layer kinds?
/// - For an open in-memory layer, this is MAX_LSN.
/// - For a frozen in-memory layer or a delta layer, this is a valid end bound.
/// - An image layer represents snapshot at one LSN, so end_lsn is always the snapshot LSN + 1
fn get_end_lsn(&self) -> Lsn;
/// Is the segment represented by this layer dropped by PostgreSQL?
@@ -140,15 +129,19 @@ pub trait Layer: Send + Sync {
/// of the *relish*, not the beginning of the segment. The requested
/// 'blknum' must be covered by this segment.
///
/// `cached_img_lsn` should be set to a cached page image's lsn < `lsn`.
/// This function will only return data after `cached_img_lsn`.
///
/// See PageReconstructResult for possible return values. The collected data
/// is appended to reconstruct_data; the caller should pass an empty struct
/// on first call. If this returns PageReconstructResult::Continue, call
/// again on the returned predecessor layer with the same 'reconstruct_data'
/// on first call. If this returns PageReconstructResult::Continue, look up
/// the predecessor layer and call again with the same 'reconstruct_data'
/// to collect more data.
fn get_page_reconstruct_data(
&self,
blknum: u32,
lsn: Lsn,
cached_img_lsn: Option<Lsn>,
reconstruct_data: &mut PageReconstructData,
) -> Result<PageReconstructResult>;
@@ -164,6 +157,9 @@ pub trait Layer: Send + Sync {
/// the previous non-incremental layer.
fn is_incremental(&self) -> bool;
/// Returns true for layers that are represented in memory.
fn is_in_memory(&self) -> bool;
/// Release memory used by this layer. There is no corresponding 'load'
/// function, that's done implicitly when you call one of the get-functions.
fn unload(&self) -> Result<()>;

View File

@@ -1,6 +1,8 @@
use layered_repository::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::PathBuf;
use std::time::Duration;
@@ -11,16 +13,46 @@ pub mod basebackup;
pub mod branches;
pub mod http;
pub mod layered_repository;
pub mod logger;
pub mod page_cache;
pub mod page_service;
pub mod relish;
pub mod remote_storage;
pub mod repository;
pub mod restore_local_repo;
pub mod tenant_mgr;
pub mod tenant_threads;
pub mod virtual_file;
pub mod waldecoder;
pub mod walreceiver;
pub mod walredo;
pub mod defaults {
use const_format::formatcp;
use std::time::Duration;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}");
pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 9898;
pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}");
// FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB
// would be more appropriate. But a low value forces the code to be exercised more,
// which is good for now to trigger bugs.
pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024;
pub const DEFAULT_CHECKPOINT_PERIOD: Duration = Duration::from_secs(10);
pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
pub const DEFAULT_GC_PERIOD: Duration = Duration::from_secs(10);
pub const DEFAULT_SUPERUSER: &str = "zenith_admin";
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 100;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
pub const DEFAULT_OPEN_MEM_LIMIT: usize = 128 * 1024 * 1024;
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100;
}
lazy_static! {
static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!(
"pageserver_live_connections_count",
@@ -30,15 +62,27 @@ lazy_static! {
.expect("failed to define a metric");
}
pub const LOG_FILE_NAME: &str = "pageserver.log";
#[derive(Debug, Clone)]
pub struct PageServerConf {
pub daemonize: bool,
pub listen_addr: String,
pub http_endpoint_addr: String,
pub listen_pg_addr: String,
pub listen_http_addr: String,
// Flush out an inmemory layer, if it's holding WAL older than this
// This puts a backstop on how much WAL needs to be re-digested if the
// page server crashes.
pub checkpoint_distance: u64,
pub checkpoint_period: Duration,
pub gc_horizon: u64,
pub gc_period: Duration,
pub superuser: String,
pub open_mem_limit: usize,
pub page_cache_size: usize,
pub max_file_descriptors: usize,
// Repository directory, relative to current working directory.
// Normally, the page server changes the current working directory
// to the repository, and 'workdir' is always '.'. But we don't do
@@ -52,6 +96,7 @@ pub struct PageServerConf {
pub auth_type: AuthType,
pub auth_validation_public_key_path: Option<PathBuf>,
pub remote_storage_config: Option<RemoteStorageConfig>,
}
impl PageServerConf {
@@ -60,7 +105,7 @@ impl PageServerConf {
//
fn tenants_path(&self) -> PathBuf {
self.workdir.join("tenants")
self.workdir.join(TENANTS_SEGMENT_NAME)
}
fn tenant_path(&self, tenantid: &ZTenantId) -> PathBuf {
@@ -84,21 +129,13 @@ impl PageServerConf {
}
fn timelines_path(&self, tenantid: &ZTenantId) -> PathBuf {
self.tenant_path(tenantid).join("timelines")
self.tenant_path(tenantid).join(TIMELINES_SEGMENT_NAME)
}
fn timeline_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timelines_path(tenantid).join(timelineid.to_string())
}
fn ancestor_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timeline_path(timelineid, tenantid).join("ancestor")
}
fn wal_dir_path(&self, timelineid: &ZTimelineId, tenantid: &ZTenantId) -> PathBuf {
self.timeline_path(timelineid, tenantid).join("wal")
}
//
// Postgres distribution paths
//
@@ -110,4 +147,87 @@ impl PageServerConf {
pub fn pg_lib_dir(&self) -> PathBuf {
self.pg_distrib_dir.join("lib")
}
#[cfg(test)]
fn test_repo_dir(test_name: &str) -> PathBuf {
PathBuf::from(format!("../tmp_check/test_{}", test_name))
}
#[cfg(test)]
fn dummy_conf(repo_dir: PathBuf) -> Self {
PageServerConf {
daemonize: false,
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
checkpoint_period: Duration::from_secs(10),
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: Duration::from_secs(10),
open_mem_limit: defaults::DEFAULT_OPEN_MEM_LIMIT,
page_cache_size: defaults::DEFAULT_PAGE_CACHE_SIZE,
max_file_descriptors: defaults::DEFAULT_MAX_FILE_DESCRIPTORS,
listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(),
listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(),
superuser: "zenith_admin".to_string(),
workdir: repo_dir,
pg_distrib_dir: "".into(),
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
}
}
}
/// Config for the Repository checkpointer
#[derive(Debug, Clone, Copy)]
pub enum CheckpointConfig {
// Flush in-memory data that is older than this
Distance(u64),
// Flush all in-memory data
Forced,
}
/// External backup storage configuration, enough for creating a client for that storage.
#[derive(Debug, Clone)]
pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between pageserver and the remote storage.
pub max_concurrent_sync: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration.
pub storage: RemoteStorageKind,
}
/// A kind of a remote storage to connect to, with its connection configuration.
#[derive(Debug, Clone)]
pub enum RemoteStorageKind {
/// Storage based on local file system.
/// Specify a root folder to place all stored relish data into.
LocalFs(PathBuf),
/// AWS S3 based storage, storing all relishes into the root
/// of the S3 bucket from the config.
AwsS3(S3Config),
}
/// AWS S3 bucket coordinates and access credentials to manage the bucket contents (read and write).
#[derive(Clone)]
pub struct S3Config {
/// Name of the bucket to connect to.
pub bucket_name: String,
/// The region where the bucket is located at.
pub bucket_region: String,
/// "Login" to use when connecting to bucket.
/// Can be empty for cases like AWS k8s IAM
/// where we can allow certain pods to connect
/// to the bucket directly without any credentials.
pub access_key_id: Option<String>,
/// "Password" to use when connecting to bucket.
pub secret_access_key: Option<String>,
}
impl std::fmt::Debug for S3Config {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("S3Config")
.field("bucket_name", &self.bucket_name)
.field("bucket_region", &self.bucket_region)
.finish()
}
}

View File

@@ -1,45 +0,0 @@
use crate::PageServerConf;
use anyhow::{Context, Result};
use slog::{Drain, FnValue};
use std::fs::{File, OpenOptions};
pub fn init_logging(
_conf: &PageServerConf,
log_filename: &str,
) -> Result<(slog_scope::GlobalLoggerGuard, File)> {
// Don't open the same file for output multiple times;
// the different fds could overwrite each other's output.
let log_file = OpenOptions::new()
.create(true)
.append(true)
.open(&log_filename)
.with_context(|| format!("failed to open {:?}", &log_filename))?;
let logger_file = log_file.try_clone().unwrap();
let decorator = slog_term::PlainSyncDecorator::new(logger_file);
let drain = slog_term::FullFormat::new(decorator).build();
let drain = slog::Filter::new(drain, |record: &slog::Record| {
if record.level().is_at_least(slog::Level::Info) {
return true;
}
false
});
let drain = std::sync::Mutex::new(drain).fuse();
let logger = slog::Logger::root(
drain,
slog::o!(
"location" =>
FnValue(move |record| {
format!("{}, {}:{}",
record.module(),
record.file(),
record.line()
)
}
)
),
);
Ok((slog_scope::set_global_logger(logger), log_file))
}

View File

@@ -0,0 +1,778 @@
//!
//! Global page cache
//!
//! The page cache uses up most of the memory in the page server. It is shared
//! by all tenants, and it is used to store different kinds of pages. Sharing
//! the cache allows memory to be dynamically allocated where it's needed the
//! most.
//!
//! The page cache consists of fixed-size buffers, 8 kB each to match the
//! PostgreSQL buffer size, and a Slot struct for each buffer to contain
//! information about what's stored in the buffer.
//!
//! # Locking
//!
//! There are two levels of locking involved: There's one lock for the "mapping"
//! from page identifier (tenant ID, timeline ID, rel, block, LSN) to the buffer
//! slot, and a separate lock on each slot. To read or write the contents of a
//! slot, you must hold the lock on the slot in read or write mode,
//! respectively. To change the mapping of a slot, i.e. to evict a page or to
//! assign a buffer for a page, you must hold the mapping lock and the lock on
//! the slot at the same time.
//!
//! Whenever you need to hold both locks simultenously, the slot lock must be
//! acquired first. This consistent ordering avoids deadlocks. To look up a page
//! in the cache, you would first look up the mapping, while holding the mapping
//! lock, and then lock the slot. You must release the mapping lock in between,
//! to obey the lock ordering and avoid deadlock.
//!
//! A slot can momentarily have invalid contents, even if it's already been
//! inserted to the mapping, but you must hold the write-lock on the slot until
//! the contents are valid. If you need to release the lock without initializing
//! the contents, you must remove the mapping first. We make that easy for the
//! callers with PageWriteGuard: when lock_for_write() returns an uninitialized
//! page, the caller must explicitly call guard.mark_valid() after it has
//! initialized it. If the guard is dropped without calling mark_valid(), the
//! mapping is automatically removed and the slot is marked free.
//!
use std::{
collections::{hash_map::Entry, HashMap},
convert::TryInto,
sync::{
atomic::{AtomicU8, AtomicUsize, Ordering},
RwLock, RwLockReadGuard, RwLockWriteGuard,
},
};
use once_cell::sync::OnceCell;
use tracing::error;
use zenith_utils::{
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
use crate::layered_repository::writeback_ephemeral_file;
use crate::{relish::RelTag, PageServerConf};
static PAGE_CACHE: OnceCell<PageCache> = OnceCell::new();
const TEST_PAGE_CACHE_SIZE: usize = 10;
///
/// Initialize the page cache. This must be called once at page server startup.
///
pub fn init(conf: &'static PageServerConf) {
if PAGE_CACHE
.set(PageCache::new(conf.page_cache_size))
.is_err()
{
panic!("page cache already initialized");
}
}
///
/// Get a handle to the page cache.
///
pub fn get() -> &'static PageCache {
//
// In unit tests, page server startup doesn't happen and no one calls
// page_cache::init(). Initialize it here with a tiny cache, so that the
// page cache is usable in unit tests.
//
if cfg!(test) {
PAGE_CACHE.get_or_init(|| PageCache::new(TEST_PAGE_CACHE_SIZE))
} else {
PAGE_CACHE.get().expect("page cache not initialized")
}
}
pub const PAGE_SZ: usize = postgres_ffi::pg_constants::BLCKSZ as usize;
const MAX_USAGE_COUNT: u8 = 5;
///
/// CacheKey uniquely identifies a "thing" to cache in the page cache.
///
#[derive(Debug, PartialEq, Eq, Clone)]
enum CacheKey {
MaterializedPage {
hash_key: MaterializedPageHashKey,
lsn: Lsn,
},
EphemeralPage {
file_id: u64,
blkno: u32,
},
}
#[derive(Debug, PartialEq, Eq, Hash, Clone)]
struct MaterializedPageHashKey {
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
}
#[derive(Clone)]
struct Version {
lsn: Lsn,
slot_idx: usize,
}
struct Slot {
inner: RwLock<SlotInner>,
usage_count: AtomicU8,
}
struct SlotInner {
key: Option<CacheKey>,
buf: &'static mut [u8; PAGE_SZ],
dirty: bool,
}
impl Slot {
/// Increment usage count on the buffer, with ceiling at MAX_USAGE_COUNT.
fn inc_usage_count(&self) {
let _ = self
.usage_count
.fetch_update(Ordering::Relaxed, Ordering::Relaxed, |val| {
if val == MAX_USAGE_COUNT {
None
} else {
Some(val + 1)
}
});
}
/// Decrement usage count on the buffer, unless it's already zero. Returns
/// the old usage count.
fn dec_usage_count(&self) -> u8 {
let count_res =
self.usage_count
.fetch_update(Ordering::Relaxed, Ordering::Relaxed, |val| {
if val == 0 {
None
} else {
Some(val - 1)
}
});
match count_res {
Ok(usage_count) => usage_count,
Err(usage_count) => usage_count,
}
}
}
pub struct PageCache {
/// This contains the mapping from the cache key to buffer slot that currently
/// contains the page, if any.
///
/// TODO: This is protected by a single lock. If that becomes a bottleneck,
/// this HashMap can be replaced with a more concurrent version, there are
/// plenty of such crates around.
///
/// If you add support for caching different kinds of objects, each object kind
/// can have a separate mapping map, next to this field.
materialized_page_map: RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
ephemeral_page_map: RwLock<HashMap<(u64, u32), usize>>,
/// The actual buffers with their metadata.
slots: Box<[Slot]>,
/// Index of the next candidate to evict, for the Clock replacement algorithm.
/// This is interpreted modulo the page cache size.
next_evict_slot: AtomicUsize,
}
///
/// PageReadGuard is a "lease" on a buffer, for reading. The page is kept locked
/// until the guard is dropped.
///
pub struct PageReadGuard<'i>(RwLockReadGuard<'i, SlotInner>);
impl std::ops::Deref for PageReadGuard<'_> {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
self.0.buf
}
}
///
/// PageWriteGuard is a lease on a buffer for modifying it. The page is kept locked
/// until the guard is dropped.
///
/// Counterintuitively, this is used even for a read, if the requested page is not
/// currently found in the page cache. In that case, the caller of lock_for_read()
/// is expected to fill in the page contents and call mark_valid(). Similarly
/// lock_for_write() can return an invalid buffer that the caller is expected to
/// to initialize.
///
pub struct PageWriteGuard<'i> {
inner: RwLockWriteGuard<'i, SlotInner>,
// Are the page contents currently valid?
valid: bool,
}
impl std::ops::DerefMut for PageWriteGuard<'_> {
fn deref_mut(&mut self) -> &mut Self::Target {
self.inner.buf
}
}
impl std::ops::Deref for PageWriteGuard<'_> {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
self.inner.buf
}
}
impl PageWriteGuard<'_> {
/// Mark that the buffer contents are now valid.
pub fn mark_valid(&mut self) {
assert!(self.inner.key.is_some());
assert!(
!self.valid,
"mark_valid called on a buffer that was already valid"
);
self.valid = true;
}
pub fn mark_dirty(&mut self) {
// only ephemeral pages can be dirty ATM.
assert!(matches!(
self.inner.key,
Some(CacheKey::EphemeralPage { .. })
));
self.inner.dirty = true;
}
}
impl Drop for PageWriteGuard<'_> {
///
/// If the buffer was allocated for a page that was not already in the
/// cache, but the lock_for_read/write() caller dropped the buffer without
/// initializing it, remove the mapping from the page cache.
///
fn drop(&mut self) {
assert!(self.inner.key.is_some());
if !self.valid {
let self_key = self.inner.key.as_ref().unwrap();
PAGE_CACHE.get().unwrap().remove_mapping(self_key);
self.inner.key = None;
self.inner.dirty = false;
}
}
}
/// lock_for_read() return value
pub enum ReadBufResult<'a> {
Found(PageReadGuard<'a>),
NotFound(PageWriteGuard<'a>),
}
/// lock_for_write() return value
pub enum WriteBufResult<'a> {
Found(PageWriteGuard<'a>),
NotFound(PageWriteGuard<'a>),
}
impl PageCache {
//
// Section 1.1: Public interface functions for looking up and memorizing materialized page
// versions in the page cache
//
/// Look up a materialized page version.
///
/// The 'lsn' is an upper bound, this will return the latest version of
/// the given block, but not newer than 'lsn'. Returns the actual LSN of the
/// returned page.
pub fn lookup_materialized_page(
&self,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
lsn: Lsn,
) -> Option<(Lsn, PageReadGuard)> {
let mut cache_key = CacheKey::MaterializedPage {
hash_key: MaterializedPageHashKey {
tenant_id,
timeline_id,
rel_tag,
blknum,
},
lsn,
};
if let Some(guard) = self.try_lock_for_read(&mut cache_key) {
if let CacheKey::MaterializedPage { hash_key: _, lsn } = cache_key {
Some((lsn, guard))
} else {
panic!("unexpected key type in slot");
}
} else {
None
}
}
///
/// Store an image of the given page in the cache.
///
pub fn memorize_materialized_page(
&self,
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
rel_tag: RelTag,
blknum: u32,
lsn: Lsn,
img: &[u8],
) {
let cache_key = CacheKey::MaterializedPage {
hash_key: MaterializedPageHashKey {
tenant_id,
timeline_id,
rel_tag,
blknum,
},
lsn,
};
match self.lock_for_write(&cache_key) {
WriteBufResult::Found(write_guard) => {
// We already had it in cache. Another thread must've put it there
// concurrently. Check that it had the same contents that we
// replayed.
assert!(*write_guard == img);
}
WriteBufResult::NotFound(mut write_guard) => {
write_guard.copy_from_slice(img);
write_guard.mark_valid();
}
}
}
// Section 1.2: Public interface functions for working with Ephemeral pages.
pub fn read_ephemeral_buf(&self, file_id: u64, blkno: u32) -> ReadBufResult {
let mut cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
pub fn write_ephemeral_buf(&self, file_id: u64, blkno: u32) -> WriteBufResult {
let cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_write(&cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_ephemeral(&self, drop_file_id: u64) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::EphemeralPage { file_id, blkno: _ } if *file_id == drop_file_id => {
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
inner.dirty = false;
}
_ => {}
}
}
}
}
//
// Section 2: Internal interface functions for lookup/update.
//
// To add support for a new kind of "thing" to cache, you will need
// to add public interface routines above, and code to deal with the
// "mappings" after this section. But the routines in this section should
// not require changes.
/// Look up a page in the cache.
///
/// If the search criteria is not exact, *cache_key is updated with the key
/// for exact key of the returned page. (For materialized pages, that means
/// that the LSN in 'cache_key' is updated with the LSN of the returned page
/// version.)
///
/// If no page is found, returns None and *cache_key is left unmodified.
///
fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
let cache_key_orig = cache_key.clone();
if let Some(slot_idx) = self.search_mapping(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.read().unwrap();
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageReadGuard(inner));
} else {
// search_mapping might have modified the search key; restore it.
*cache_key = cache_key_orig;
}
}
None
}
/// Return a locked buffer for given block.
///
/// Like try_lock_for_read(), if the search criteria is not exact and the
/// page is already found in the cache, *cache_key is updated.
///
/// If the page is not found in the cache, this allocates a new buffer for
/// it. The caller may then initialize the buffer with the contents, and
/// call mark_valid().
///
/// Example usage:
///
/// ```ignore
/// let cache = page_cache::get();
///
/// match cache.lock_for_read(&key) {
/// ReadBufResult::Found(read_guard) => {
/// // The page was found in cache. Use it
/// },
/// ReadBufResult::NotFound(write_guard) => {
/// // The page was not found in cache. Read it from disk into the
/// // buffer.
/// //read_my_page_from_disk(write_guard);
///
/// // The buffer contents are now valid. Tell the page cache.
/// write_guard.mark_valid();
/// },
/// }
/// ```
///
fn lock_for_read(&self, cache_key: &mut CacheKey) -> ReadBufResult {
loop {
// First check if the key already exists in the cache.
if let Some(read_guard) = self.try_lock_for_read(cache_key) {
return ReadBufResult::Found(read_guard);
}
// Not found. Find a victim buffer
let (slot_idx, mut inner) = self.find_victim();
// Insert mapping for this. At this point, we may find that another
// thread did the same thing concurrently. In that case, we evicted
// our victim buffer unnecessarily. Put it into the free list and
// continue with the slot that the other thread chose.
if let Some(_existing_slot_idx) = self.try_insert_mapping(cache_key, slot_idx) {
// TODO: put to free list
// We now just loop back to start from beginning. This is not
// optimal, we'll perform the lookup in the mapping again, which
// is not really necessary because we already got
// 'existing_slot_idx'. But this shouldn't happen often enough
// to matter much.
continue;
}
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
return ReadBufResult::NotFound(PageWriteGuard {
inner,
valid: false,
});
}
}
/// Look up a page in the cache and lock it in write mode. If it's not
/// found, returns None.
///
/// When locking a page for writing, the search criteria is always "exact".
fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
if let Some(slot_idx) = self.search_mapping_for_write(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we don't released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.write().unwrap();
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageWriteGuard { inner, valid: true });
}
}
None
}
/// Return a write-locked buffer for given block.
///
/// Similar to lock_for_read(), but the returned buffer is write-locked and
/// may be modified by the caller even if it's already found in the cache.
fn lock_for_write(&self, cache_key: &CacheKey) -> WriteBufResult {
loop {
// First check if the key already exists in the cache.
if let Some(write_guard) = self.try_lock_for_write(cache_key) {
return WriteBufResult::Found(write_guard);
}
// Not found. Find a victim buffer
let (slot_idx, mut inner) = self.find_victim();
// Insert mapping for this. At this point, we may find that another
// thread did the same thing concurrently. In that case, we evicted
// our victim buffer unnecessarily. Put it into the free list and
// continue with the slot that the other thread chose.
if let Some(_existing_slot_idx) = self.try_insert_mapping(cache_key, slot_idx) {
// TODO: put to free list
// We now just loop back to start from beginning. This is not
// optimal, we'll perform the lookup in the mapping again, which
// is not really necessary because we already got
// 'existing_slot_idx'. But this shouldn't happen often enough
// to matter much.
continue;
}
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
return WriteBufResult::NotFound(PageWriteGuard {
inner,
valid: false,
});
}
}
//
// Section 3: Mapping functions
//
/// Search for a page in the cache using the given search key.
///
/// Returns the slot index, if any. If the search criteria is not exact,
/// *cache_key is updated with the actual key of the found page.
///
/// NOTE: We don't hold any lock on the mapping on return, so the slot might
/// get recycled for an unrelated page immediately after this function
/// returns. The caller is responsible for re-checking that the slot still
/// contains the page with the same key before using it.
///
fn search_mapping(&self, cache_key: &mut CacheKey) -> Option<usize> {
match cache_key {
CacheKey::MaterializedPage { hash_key, lsn } => {
let map = self.materialized_page_map.read().unwrap();
let versions = map.get(hash_key)?;
let version_idx = match versions.binary_search_by_key(lsn, |v| v.lsn) {
Ok(version_idx) => version_idx,
Err(0) => return None,
Err(version_idx) => version_idx - 1,
};
let version = &versions[version_idx];
*lsn = version.lsn;
Some(version.slot_idx)
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
}
}
/// Search for a page in the cache using the given search key.
///
/// Like 'search_mapping, but performs an "exact" search. Used for
/// allocating a new buffer.
fn search_mapping_for_write(&self, key: &CacheKey) -> Option<usize> {
match key {
CacheKey::MaterializedPage { hash_key, lsn } => {
let map = self.materialized_page_map.read().unwrap();
let versions = map.get(hash_key)?;
if let Ok(version_idx) = versions.binary_search_by_key(lsn, |v| v.lsn) {
Some(versions[version_idx].slot_idx)
} else {
None
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
}
}
///
/// Remove mapping for given key.
///
fn remove_mapping(&self, old_key: &CacheKey) {
match old_key {
CacheKey::MaterializedPage {
hash_key: old_hash_key,
lsn: old_lsn,
} => {
let mut map = self.materialized_page_map.write().unwrap();
if let Entry::Occupied(mut old_entry) = map.entry(old_hash_key.clone()) {
let versions = old_entry.get_mut();
if let Ok(version_idx) = versions.binary_search_by_key(old_lsn, |v| v.lsn) {
versions.remove(version_idx);
if versions.is_empty() {
old_entry.remove_entry();
}
}
} else {
panic!("could not find old key in mapping")
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
map.remove(&(*file_id, *blkno))
.expect("could not find old key in mapping");
}
}
}
///
/// Insert mapping for given key.
///
/// If a mapping already existed for the given key, returns the slot index
/// of the existing mapping and leaves it untouched.
fn try_insert_mapping(&self, new_key: &CacheKey, slot_idx: usize) -> Option<usize> {
match new_key {
CacheKey::MaterializedPage {
hash_key: new_key,
lsn: new_lsn,
} => {
let mut map = self.materialized_page_map.write().unwrap();
let versions = map.entry(new_key.clone()).or_default();
match versions.binary_search_by_key(new_lsn, |v| v.lsn) {
Ok(version_idx) => Some(versions[version_idx].slot_idx),
Err(version_idx) => {
versions.insert(
version_idx,
Version {
lsn: *new_lsn,
slot_idx,
},
);
None
}
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
match map.entry((*file_id, *blkno)) {
Entry::Occupied(entry) => Some(*entry.get()),
Entry::Vacant(entry) => {
entry.insert(slot_idx);
None
}
}
}
}
}
//
// Section 4: Misc internal helpers
//
/// Find a slot to evict.
///
/// On return, the slot is empty and write-locked.
fn find_victim(&self) -> (usize, RwLockWriteGuard<SlotInner>) {
let iter_limit = self.slots.len() * 2;
let mut iters = 0;
loop {
let slot_idx = self.next_evict_slot.fetch_add(1, Ordering::Relaxed) % self.slots.len();
let slot = &self.slots[slot_idx];
if slot.dec_usage_count() == 0 || iters >= iter_limit {
let mut inner = slot.inner.write().unwrap();
if let Some(old_key) = &inner.key {
if inner.dirty {
if let Err(err) = Self::writeback(old_key, inner.buf) {
// Writing the page to disk failed.
//
// FIXME: What to do here, when? We could propagate the error to the
// caller, but victim buffer is generally unrelated to the original
// call. It can even belong to a different tenant. Currently, we
// report the error to the log and continue the clock sweep to find
// a different victim. But if the problem persists, the page cache
// could fill up with dirty pages that we cannot evict, and we will
// loop retrying the writebacks indefinitely.
error!("writeback of buffer {:?} failed: {}", old_key, err);
continue;
}
}
// remove mapping for old buffer
self.remove_mapping(old_key);
inner.dirty = false;
inner.key = None;
}
return (slot_idx, inner);
}
iters += 1;
}
}
fn writeback(cache_key: &CacheKey, buf: &[u8]) -> Result<(), std::io::Error> {
match cache_key {
CacheKey::MaterializedPage {
hash_key: _,
lsn: _,
} => {
panic!("unexpected dirty materialized page");
}
CacheKey::EphemeralPage { file_id, blkno } => {
writeback_ephemeral_file(*file_id, *blkno, buf)
}
}
}
/// Initialize a new page cache
///
/// This should be called only once at page server startup.
fn new(num_pages: usize) -> Self {
assert!(num_pages > 0, "page cache size must be > 0");
let page_buffer = Box::leak(vec![0u8; num_pages * PAGE_SZ].into_boxed_slice());
let slots = page_buffer
.chunks_exact_mut(PAGE_SZ)
.map(|chunk| {
let buf: &mut [u8; PAGE_SZ] = chunk.try_into().unwrap();
Slot {
inner: RwLock::new(SlotInner {
key: None,
buf,
dirty: false,
}),
usage_count: AtomicU8::new(0),
}
})
.collect();
Self {
materialized_page_map: Default::default(),
ephemeral_page_map: Default::default(),
slots,
next_evict_slot: AtomicUsize::new(0),
}
}
}

View File

@@ -13,7 +13,6 @@
use anyhow::{anyhow, bail, ensure, Context, Result};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use lazy_static::lazy_static;
use log::*;
use regex::Regex;
use std::net::TcpListener;
use std::str;
@@ -21,10 +20,12 @@ use std::str::FromStr;
use std::sync::Arc;
use std::thread;
use std::{io, net::TcpStream};
use tracing::*;
use zenith_metrics::{register_histogram_vec, HistogramVec};
use zenith_utils::auth::{self, JwtAuth};
use zenith_utils::auth::{Claims, Scope};
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::is_socket_read_timed_out;
use zenith_utils::postgres_backend::PostgresBackend;
use zenith_utils::postgres_backend::{self, AuthType};
use zenith_utils::pq_proto::{
@@ -35,72 +36,110 @@ use zenith_utils::zid::{ZTenantId, ZTimelineId};
use crate::basebackup;
use crate::branches;
use crate::relish::*;
use crate::repository::Timeline;
use crate::tenant_mgr;
use crate::walreceiver;
use crate::PageServerConf;
// Wrapped in libpq CopyData
enum PagestreamFeMessage {
Exists(PagestreamRequest),
Nblocks(PagestreamRequest),
Read(PagestreamRequest),
Exists(PagestreamExistsRequest),
Nblocks(PagestreamNblocksRequest),
GetPage(PagestreamGetPageRequest),
}
// Wrapped in libpq CopyData
enum PagestreamBeMessage {
Status(PagestreamStatusResponse),
Nblocks(PagestreamStatusResponse),
Read(PagestreamReadResponse),
Exists(PagestreamExistsResponse),
Nblocks(PagestreamNblocksResponse),
GetPage(PagestreamGetPageResponse),
Error(PagestreamErrorResponse),
}
#[derive(Debug)]
struct PagestreamRequest {
spcnode: u32,
dbnode: u32,
relnode: u32,
forknum: u8,
blkno: u32,
struct PagestreamExistsRequest {
latest: bool,
lsn: Lsn,
rel: RelTag,
}
#[derive(Debug)]
struct PagestreamStatusResponse {
ok: bool,
struct PagestreamNblocksRequest {
latest: bool,
lsn: Lsn,
rel: RelTag,
}
#[derive(Debug)]
struct PagestreamGetPageRequest {
latest: bool,
lsn: Lsn,
rel: RelTag,
blkno: u32,
}
#[derive(Debug)]
struct PagestreamExistsResponse {
exists: bool,
}
#[derive(Debug)]
struct PagestreamNblocksResponse {
n_blocks: u32,
}
#[derive(Debug)]
struct PagestreamReadResponse {
ok: bool,
n_blocks: u32,
struct PagestreamGetPageResponse {
page: Bytes,
}
#[derive(Debug)]
struct PagestreamErrorResponse {
message: String,
}
impl PagestreamFeMessage {
fn parse(mut body: Bytes) -> anyhow::Result<PagestreamFeMessage> {
// TODO these gets can fail
let smgr_tag = body.get_u8();
let zreq = PagestreamRequest {
spcnode: body.get_u32(),
dbnode: body.get_u32(),
relnode: body.get_u32(),
forknum: body.get_u8(),
blkno: body.get_u32(),
lsn: Lsn::from(body.get_u64()),
};
// these correspond to the ZenithMessageTag enum in pagestore_client.h
//
// TODO: consider using protobuf or serde bincode for less error prone
// serialization.
match smgr_tag {
0 => Ok(PagestreamFeMessage::Exists(zreq)),
1 => Ok(PagestreamFeMessage::Nblocks(zreq)),
2 => Ok(PagestreamFeMessage::Read(zreq)),
_ => Err(anyhow!(
"unknown smgr message tag: {},'{:?}'",
smgr_tag,
body
)),
let msg_tag = body.get_u8();
match msg_tag {
0 => Ok(PagestreamFeMessage::Exists(PagestreamExistsRequest {
latest: body.get_u8() != 0,
lsn: Lsn::from(body.get_u64()),
rel: RelTag {
spcnode: body.get_u32(),
dbnode: body.get_u32(),
relnode: body.get_u32(),
forknum: body.get_u8(),
},
})),
1 => Ok(PagestreamFeMessage::Nblocks(PagestreamNblocksRequest {
latest: body.get_u8() != 0,
lsn: Lsn::from(body.get_u64()),
rel: RelTag {
spcnode: body.get_u32(),
dbnode: body.get_u32(),
relnode: body.get_u32(),
forknum: body.get_u8(),
},
})),
2 => Ok(PagestreamFeMessage::GetPage(PagestreamGetPageRequest {
latest: body.get_u8() != 0,
lsn: Lsn::from(body.get_u64()),
rel: RelTag {
spcnode: body.get_u32(),
dbnode: body.get_u32(),
relnode: body.get_u32(),
forknum: body.get_u8(),
},
blkno: body.get_u32(),
})),
_ => bail!("unknown smgr message tag: {},'{:?}'", msg_tag, body),
}
}
}
@@ -110,24 +149,26 @@ impl PagestreamBeMessage {
let mut bytes = BytesMut::new();
match self {
Self::Status(resp) => {
Self::Exists(resp) => {
bytes.put_u8(100); /* tag from pagestore_client.h */
bytes.put_u8(resp.ok as u8);
bytes.put_u32(resp.n_blocks);
bytes.put_u8(resp.exists as u8);
}
Self::Nblocks(resp) => {
bytes.put_u8(101); /* tag from pagestore_client.h */
bytes.put_u8(resp.ok as u8);
bytes.put_u32(resp.n_blocks);
}
Self::Read(resp) => {
Self::GetPage(resp) => {
bytes.put_u8(102); /* tag from pagestore_client.h */
bytes.put_u8(resp.ok as u8);
bytes.put_u32(resp.n_blocks);
bytes.put(&resp.page[..]);
}
Self::Error(resp) => {
bytes.put_u8(103); /* tag from pagestore_client.h */
bytes.put(resp.message.as_bytes());
bytes.put_u8(0); // null terminator
}
}
bytes.into()
@@ -147,17 +188,32 @@ pub fn thread_main(
listener: TcpListener,
auth_type: AuthType,
) -> anyhow::Result<()> {
loop {
let mut join_handles = Vec::new();
while !tenant_mgr::shutdown_requested() {
let (socket, peer_addr) = listener.accept()?;
debug!("accepted connection from {}", peer_addr);
socket.set_nodelay(true).unwrap();
let local_auth = auth.clone();
thread::spawn(move || {
if let Err(err) = page_service_conn_main(conf, local_auth, socket, auth_type) {
error!("error: {}", err);
}
});
let handle = thread::Builder::new()
.name("serving Page Service thread".into())
.spawn(move || {
if let Err(err) = page_service_conn_main(conf, local_auth, socket, auth_type) {
error!(%err, "page server thread exited with error");
}
})
.unwrap();
join_handles.push(handle);
}
debug!("page_service loop terminated. wait for connections to cancel");
for handle in join_handles.into_iter() {
handle.join().unwrap();
}
Ok(())
}
fn page_service_conn_main(
@@ -176,7 +232,7 @@ fn page_service_conn_main(
}
let mut conn_handler = PageServerHandler::new(conf, auth);
let pgbackend = PostgresBackend::new(socket, auth_type, None)?;
let pgbackend = PostgresBackend::new(socket, auth_type, None, true)?;
pgbackend.run(&mut conn_handler)
}
@@ -214,122 +270,180 @@ impl PageServerHandler {
}
}
fn handle_controlfile(&self, pgb: &mut PostgresBackend) -> io::Result<()> {
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::ControlFile)?
.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?;
Ok(())
}
fn handle_pagerequests(
&self,
pgb: &mut PostgresBackend,
timelineid: ZTimelineId,
tenantid: ZTenantId,
) -> anyhow::Result<()> {
let _enter = info_span!("pagestream", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
let repository = tenant_mgr::get_repository_for_tenant(&tenantid)?;
let timeline = repository
.get_timeline(timelineid)
.context(format!("error fetching timeline {}", timelineid))?;
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)?;
/* switch client to COPYBOTH */
pgb.write_message(&BeMessage::CopyBothResponse)?;
while let Some(message) = pgb.read_message()? {
trace!("query({:?}): {:?}", timelineid, message);
while !tenant_mgr::shutdown_requested() {
match pgb.read_message() {
Ok(message) => {
if let Some(message) = message {
trace!("query: {:?}", message);
let copy_data_bytes = match message {
FeMessage::CopyData(bytes) => bytes,
_ => continue,
};
let copy_data_bytes = match message {
FeMessage::CopyData(bytes) => bytes,
_ => continue,
};
let zenith_fe_msg = PagestreamFeMessage::parse(copy_data_bytes)?;
let zenith_fe_msg = PagestreamFeMessage::parse(copy_data_bytes)?;
let response = match zenith_fe_msg {
PagestreamFeMessage::Exists(req) => {
let rel = RelTag {
spcnode: req.spcnode,
dbnode: req.dbnode,
relnode: req.relnode,
forknum: req.forknum,
};
let tag = RelishTag::Relation(rel);
let response = match zenith_fe_msg {
PagestreamFeMessage::Exists(req) => SMGR_QUERY_TIME
.with_label_values(&["get_rel_exists"])
.observe_closure_duration(|| {
self.handle_get_rel_exists_request(&*timeline, &req)
}),
PagestreamFeMessage::Nblocks(req) => SMGR_QUERY_TIME
.with_label_values(&["get_rel_size"])
.observe_closure_duration(|| {
self.handle_get_nblocks_request(&*timeline, &req)
}),
PagestreamFeMessage::GetPage(req) => SMGR_QUERY_TIME
.with_label_values(&["get_page_at_lsn"])
.observe_closure_duration(|| {
self.handle_get_page_at_lsn_request(&*timeline, &req)
}),
};
let exist = SMGR_QUERY_TIME
.with_label_values(&["get_rel_exists"])
.observe_closure_duration(|| {
timeline.get_rel_exists(tag, req.lsn).unwrap_or(false)
let response = response.unwrap_or_else(|e| {
// print the all details to the log with {:#}, but for the client the
// error message is enough
error!("error reading relation or page version: {:#}", e);
PagestreamBeMessage::Error(PagestreamErrorResponse {
message: e.to_string(),
})
});
PagestreamBeMessage::Status(PagestreamStatusResponse {
ok: exist,
n_blocks: 0,
})
pgb.write_message(&BeMessage::CopyData(&response.serialize()))?;
} else {
break;
}
}
PagestreamFeMessage::Nblocks(req) => {
let rel = RelTag {
spcnode: req.spcnode,
dbnode: req.dbnode,
relnode: req.relnode,
forknum: req.forknum,
};
let tag = RelishTag::Relation(rel);
let n_blocks = SMGR_QUERY_TIME
.with_label_values(&["get_rel_size"])
.observe_closure_duration(|| {
// Return 0 if relation is not found.
// This is what postgres smgr expects.
timeline
.get_relish_size(tag, req.lsn)
.unwrap_or(Some(0))
.unwrap_or(0)
});
PagestreamBeMessage::Nblocks(PagestreamStatusResponse { ok: true, n_blocks })
Err(e) => {
if !is_socket_read_timed_out(&e) {
return Err(e);
}
}
PagestreamFeMessage::Read(req) => {
let rel = RelTag {
spcnode: req.spcnode,
dbnode: req.dbnode,
relnode: req.relnode,
forknum: req.forknum,
};
let tag = RelishTag::Relation(rel);
let read_response = SMGR_QUERY_TIME
.with_label_values(&["get_page_at_lsn"])
.observe_closure_duration(|| {
match timeline.get_page_at_lsn(tag, req.blkno, req.lsn) {
Ok(p) => PagestreamReadResponse {
ok: true,
n_blocks: 0,
page: p,
},
Err(e) => {
const ZERO_PAGE: [u8; 8192] = [0; 8192];
error!("get_page_at_lsn: {}", e);
PagestreamReadResponse {
ok: false,
n_blocks: 0,
page: Bytes::from_static(&ZERO_PAGE),
}
}
}
});
PagestreamBeMessage::Read(read_response)
}
};
pgb.write_message(&BeMessage::CopyData(&response.serialize()))?;
}
}
Ok(())
}
/// Helper function to handle the LSN from client request.
///
/// Each GetPage (and Exists and Nblocks) request includes information about
/// which version of the page is being requested. The client can request the
/// latest version of the page, or the version that's valid at a particular
/// LSN. The primary compute node will always request the latest page
/// version, while a standby will request a version at the LSN that it's
/// currently caught up to.
///
/// In either case, if the page server hasn't received the WAL up to the
/// requested LSN yet, we will wait for it to arrive. The return value is
/// the LSN that should be used to look up the page versions.
fn wait_or_get_last_lsn(timeline: &dyn Timeline, lsn: Lsn, latest: bool) -> Result<Lsn> {
if latest {
// Latest page version was requested. If LSN is given, it is a hint
// to the page server that there have been no modifications to the
// page after that LSN. If we haven't received WAL up to that point,
// wait until it arrives.
let last_record_lsn = timeline.get_last_record_lsn();
// Note: this covers the special case that lsn == Lsn(0). That
// special case means "return the latest version whatever it is",
// and it's used for bootstrapping purposes, when the page server is
// connected directly to the compute node. That is needed because
// when you connect to the compute node, to receive the WAL, the
// walsender process will do a look up in the pg_authid catalog
// table for authentication. That poses a deadlock problem: the
// catalog table lookup will send a GetPage request, but the GetPage
// request will block in the page server because the recent WAL
// hasn't been received yet, and it cannot be received until the
// walsender completes the authentication and starts streaming the
// WAL.
if lsn <= last_record_lsn {
Ok(last_record_lsn)
} else {
timeline.wait_lsn(lsn)?;
// Since we waited for 'lsn' to arrive, that is now the last
// record LSN. (Or close enough for our purposes; the
// last-record LSN can advance immediately after we return
// anyway)
Ok(lsn)
}
} else {
if lsn == Lsn(0) {
bail!("invalid LSN(0) in request");
}
timeline.wait_lsn(lsn)?;
Ok(lsn)
}
}
fn handle_get_rel_exists_request(
&self,
timeline: &dyn Timeline,
req: &PagestreamExistsRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_rel_exists", rel = %req.rel, req_lsn = %req.lsn).entered();
let tag = RelishTag::Relation(req.rel);
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest)?;
let exists = timeline.get_rel_exists(tag, lsn)?;
Ok(PagestreamBeMessage::Exists(PagestreamExistsResponse {
exists,
}))
}
fn handle_get_nblocks_request(
&self,
timeline: &dyn Timeline,
req: &PagestreamNblocksRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_nblocks", rel = %req.rel, req_lsn = %req.lsn).entered();
let tag = RelishTag::Relation(req.rel);
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest)?;
let n_blocks = timeline.get_relish_size(tag, lsn)?;
// Return 0 if relation is not found.
// This is what postgres smgr expects.
let n_blocks = n_blocks.unwrap_or(0);
Ok(PagestreamBeMessage::Nblocks(PagestreamNblocksResponse {
n_blocks,
}))
}
fn handle_get_page_at_lsn_request(
&self,
timeline: &dyn Timeline,
req: &PagestreamGetPageRequest,
) -> Result<PagestreamBeMessage> {
let _enter = info_span!("get_page", rel = %req.rel, blkno = &req.blkno, req_lsn = %req.lsn)
.entered();
let tag = RelishTag::Relation(req.rel);
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest)?;
let page = timeline.get_page_at_lsn(tag, req.blkno, lsn)?;
Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse {
page,
}))
}
fn handle_basebackup_request(
&self,
pgb: &mut PostgresBackend,
@@ -337,19 +451,25 @@ impl PageServerHandler {
lsn: Option<Lsn>,
tenantid: ZTenantId,
) -> anyhow::Result<()> {
let span = info_span!("basebackup", timeline = %timelineid, tenant = %tenantid, lsn = field::Empty);
let _enter = span.enter();
// check that the timeline exists
let repository = tenant_mgr::get_repository_for_tenant(&tenantid)?;
let timeline = repository
.get_timeline(timelineid)
.context(format!("error fetching timeline {}", timelineid))?;
/* switch client to COPYOUT */
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)?;
if let Some(lsn) = lsn {
timeline
.check_lsn_is_in_scope(lsn)
.context("invalid basebackup lsn")?;
}
// switch client to COPYOUT
pgb.write_message(&BeMessage::CopyOutResponse)?;
info!("sent CopyOut");
/* Send a tarball of the latest layer on the timeline */
{
let mut writer = CopyDataSink { pgb };
let mut basebackup = basebackup::Basebackup::new(&mut writer, &timeline, lsn);
let mut basebackup = basebackup::Basebackup::new(&mut writer, &timeline, lsn)?;
span.record("lsn", &basebackup.lsn.to_string().as_str());
basebackup.send_tarball()?;
}
pgb.write_message(&BeMessage::CopyDone)?;
@@ -421,9 +541,7 @@ impl postgres_backend::Handler for PageServerHandler {
}
let query_string = std::str::from_utf8(&query_string)?;
if query_string.starts_with("controlfile") {
self.handle_controlfile(pgb)?;
} else if query_string.starts_with("pagestream ") {
if query_string.starts_with("pagestream ") {
let (_, params_raw) = query_string.split_at("pagestream ".len());
let params = params_raw.split(' ').collect::<Vec<_>>();
ensure!(
@@ -456,11 +574,6 @@ impl postgres_backend::Handler for PageServerHandler {
None
};
info!(
"got basebackup command. tenantid=\"{}\" timelineid=\"{}\" lsn=\"{:#?}\"",
tenantid, timelineid, lsn
);
// Check that the timeline exists
self.handle_basebackup_request(pgb, timelineid, lsn, tenantid)?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
@@ -478,11 +591,11 @@ impl postgres_backend::Handler for PageServerHandler {
self.check_permission(Some(tenantid))?;
let _enter =
info_span!("callmemaybe", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
let repository = tenant_mgr::get_repository_for_tenant(&tenantid)?;
repository
.get_timeline(timelineid)
.context(format!("error fetching timeline {}", timelineid))?;
tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)?;
walreceiver::launch_wal_receiver(self.conf, timelineid, &connstr, tenantid.to_owned());
@@ -503,6 +616,9 @@ impl postgres_backend::Handler for PageServerHandler {
self.check_permission(Some(tenantid))?;
let _enter =
info_span!("branch_create", name = %branchname, tenant = %tenantid).entered();
let branch =
branches::create_branch(self.conf, &branchname, &startpoint_str, &tenantid)?;
let branch = serde_json::to_vec(&branch)?;
@@ -519,14 +635,16 @@ impl postgres_backend::Handler for PageServerHandler {
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let branches = crate::branches::get_branches(self.conf, &tenantid)?;
// since these handlers for tenant/branch commands are deprecated (in favor of http based ones)
// just use false in place of include non incremental logical size
let branches = crate::branches::get_branches(self.conf, &tenantid, false)?;
let branches_buf = serde_json::to_vec(&branches)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(&branches_buf)]))?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("tenant_list") {
let tenants = crate::branches::get_tenants(self.conf)?;
let tenants = crate::tenant_mgr::list_tenants()?;
let tenants_buf = serde_json::to_vec(&tenants)?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
@@ -577,21 +695,21 @@ impl postgres_backend::Handler for PageServerHandler {
.map(|h| h.as_str().parse())
.unwrap_or(Ok(self.conf.gc_horizon))?;
let repo = tenant_mgr::get_repository_for_tenant(&tenantid)?;
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"layer_relfiles_total"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_relfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_relfiles_not_updated"),
RowDescriptor::int8_col(b"layer_relfiles_needed_as_tombstone"),
RowDescriptor::int8_col(b"layer_relfiles_removed"),
RowDescriptor::int8_col(b"layer_relfiles_dropped"),
RowDescriptor::int8_col(b"layer_nonrelfiles_total"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_cutoff"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_branches"),
RowDescriptor::int8_col(b"layer_nonrelfiles_not_updated"),
RowDescriptor::int8_col(b"layer_nonrelfiles_needed_as_tombstone"),
RowDescriptor::int8_col(b"layer_nonrelfiles_removed"),
RowDescriptor::int8_col(b"layer_nonrelfiles_dropped"),
RowDescriptor::int8_col(b"elapsed"),
@@ -611,6 +729,12 @@ impl postgres_backend::Handler for PageServerHandler {
.as_bytes(),
),
Some(result.ondisk_relfiles_not_updated.to_string().as_bytes()),
Some(
result
.ondisk_relfiles_needed_as_tombstone
.to_string()
.as_bytes(),
),
Some(result.ondisk_relfiles_removed.to_string().as_bytes()),
Some(result.ondisk_relfiles_dropped.to_string().as_bytes()),
Some(result.ondisk_nonrelfiles_total.to_string().as_bytes()),
@@ -627,6 +751,12 @@ impl postgres_backend::Handler for PageServerHandler {
.as_bytes(),
),
Some(result.ondisk_nonrelfiles_not_updated.to_string().as_bytes()),
Some(
result
.ondisk_nonrelfiles_needed_as_tombstone
.to_string()
.as_bytes(),
),
Some(result.ondisk_nonrelfiles_removed.to_string().as_bytes()),
Some(result.ondisk_nonrelfiles_dropped.to_string().as_bytes()),
Some(result.elapsed.as_millis().to_string().as_bytes()),

View File

@@ -0,0 +1,182 @@
//! A set of generic storage abstractions for the page server to use when backing up and restoring its state from the external storage.
//! This particular module serves as a public API border between pageserver and the internal storage machinery.
//! No other modules from this tree are supposed to be used directly by the external code.
//!
//! There are a few components the storage machinery consists of:
//! * [`RemoteStorage`] trait a CRUD-like generic abstraction to use for adapting external storages with a few implementations:
//! * [`local_fs`] allows to use local file system as an external storage
//! * [`rust_s3`] uses AWS S3 bucket entirely as an external storage
//!
//! * synchronization logic at [`storage_sync`] module that keeps pageserver state (both runtime one and the workdir files) and storage state in sync.
//!
//! * public API via to interact with the external world: [`run_storage_sync_thread`] and [`schedule_timeline_checkpoint_upload`]
//!
//! Here's a schematic overview of all interactions backup and the rest of the pageserver perform:
//!
//! +------------------------+ +--------->-------+
//! | | - - - (init async loop) - - - -> | |
//! | | | |
//! | | -------------------------------> | async |
//! | pageserver | (schedule checkpoint upload) | upload/download |
//! | | | loop |
//! | | <------------------------------- | |
//! | | (register downloaded layers) | |
//! +------------------------+ +---------<-------+
//! |
//! |
//! CRUD layer file operations |
//! (upload/download/delete/list, etc.) |
//! V
//! +------------------------+
//! | |
//! | [`RemoteStorage`] impl |
//! | |
//! | pageserver assumes it |
//! | owns exclusive write |
//! | access to this storage |
//! +------------------------+
//!
//! First, during startup, the pageserver inits the storage sync thread with the async loop, or leaves the loop unitialised, if configured so.
//! Some time later, during pageserver checkpoints, in-memory data is flushed onto disk along with its metadata.
//! If the storage sync loop was successfully started before, pageserver schedules the new image uploads after every checkpoint.
//! See [`crate::layered_repository`] for the upload calls and the adjacent logic.
//!
//! The storage logic considers `image` as a set of local files, fully representing a certain timeline at given moment (identified with `disk_consistent_lsn`).
//! Timeline can change its state, by adding more files on disk and advancing its `disk_consistent_lsn`: this happens after pageserver checkpointing and is followed
//! by the storage upload, if enabled.
//! When a certain image gets uploaded, the sync loop remembers the fact, preventing further reuploads of the same image state.
//! No files are deleted from either local or remote storage, only the missing ones locally/remotely get downloaded/uploaded, local metadata file will be overwritten
//! when the newer timeline is downloaded.
//!
//! Meanwhile, the loop inits the storage connection and checks the remote files stored.
//! This is done once at startup only, relying on the fact that pageserver uses the storage alone (ergo, nobody else uploads the files to the storage but this server).
//! Based on the remote image data, the storage sync logic queues image downloads, while accepting any potential upload tasks from pageserver and managing the tasks by their priority.
//! On the image download, a [`crate::tenant_mgr::register_relish_download`] function is called to register the new image in pageserver, initializing all related threads and internal state.
//!
//! When the pageserver terminates, the upload loop finishes a current image sync task (if any) and exits.
//!
//! NOTES:
//! * pageserver assumes it has exclusive write access to the remote storage. If supported, the way multiple pageservers can be separated in the same storage
//! (i.e. using different directories in the local filesystem external storage), but totally up to the storage implementation and not covered with the trait API.
//!
//! * the uploads do not happen right after pageserver startup, they are registered when
//! 1. pageserver does the checkpoint, which happens further in the future after the server start
//! 2. pageserver loads the timeline from disk for the first time
//!
//! * the uploads do not happen right after the upload registration: the sync loop might be occupied with other tasks, or tasks with bigger priority could be waiting already
//!
//! * all synchronization tasks (including the public API to register uploads and downloads and the sync queue management) happens on an image scale: a big set of remote files,
//! enough to represent (and recover, if needed) a certain timeline state. On the contrary, all internal storage CRUD calls are made per reilsh file from those images.
//! This way, the synchronization is able to download the image partially, if some state was synced before, but exposes correctly synced images only.
mod local_fs;
mod rust_s3;
mod storage_sync;
use std::{
path::{Path, PathBuf},
thread,
};
use anyhow::Context;
use tokio::io;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
pub use self::storage_sync::schedule_timeline_checkpoint_upload;
use self::{local_fs::LocalFs, rust_s3::S3};
use crate::{PageServerConf, RemoteStorageKind};
/// Any timeline has its own id and its own tenant it belongs to,
/// the sync processes group timelines by both for simplicity.
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy, Hash)]
pub struct TimelineSyncId(ZTenantId, ZTimelineId);
/// Based on the config, initiates the remote storage connection and starts a separate thread
/// that ensures that pageserver and the remote storage are in sync with each other.
/// If no external configuraion connection given, no thread or storage initialization is done.
pub fn run_storage_sync_thread(
config: &'static PageServerConf,
) -> anyhow::Result<Option<thread::JoinHandle<anyhow::Result<()>>>> {
match &config.remote_storage_config {
Some(storage_config) => {
let max_concurrent_sync = storage_config.max_concurrent_sync;
let max_sync_errors = storage_config.max_sync_errors;
let handle = match &storage_config.storage {
RemoteStorageKind::LocalFs(root) => storage_sync::spawn_storage_sync_thread(
config,
LocalFs::new(root.clone(), &config.workdir)?,
max_concurrent_sync,
max_sync_errors,
),
RemoteStorageKind::AwsS3(s3_config) => storage_sync::spawn_storage_sync_thread(
config,
S3::new(s3_config, &config.workdir)?,
max_concurrent_sync,
max_sync_errors,
),
};
handle.map(Some)
}
None => Ok(None),
}
}
/// Storage (potentially remote) API to manage its state.
/// This storage tries to be unaware of any layered repository context,
/// providing basic CRUD operations with storage files.
#[async_trait::async_trait]
trait RemoteStorage: Send + Sync {
/// A way to uniquely reference a file in the remote storage.
type StoragePath;
/// Attempts to derive the storage path out of the local path, if the latter is correct.
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath>;
/// Gets the download path of the given storage file.
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf>;
/// Lists all items the storage has right now.
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>>;
/// Streams the local file contents into remote into the remote storage entry.
async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
) -> anyhow::Result<()>;
/// Streams the remote storage entry contents into the buffered writer given, returns the filled writer.
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()>;
/// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer.
async fn download_range(
&self,
from: &Self::StoragePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()>;
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()>;
}
fn strip_path_prefix<'a>(prefix: &'a Path, path: &'a Path) -> anyhow::Result<&'a Path> {
if prefix == path {
anyhow::bail!(
"Prefix and the path are equal, cannot strip: '{}'",
prefix.display()
)
} else {
path.strip_prefix(prefix).with_context(|| {
format!(
"Path '{}' is not prefixed with '{}'",
path.display(),
prefix.display(),
)
})
}
}

View File

@@ -0,0 +1,75 @@
# Non-implementation details
This document describes the current state of the backup system in pageserver, existing limitations and concerns, why some things are done the way they are the future development plans.
Detailed description on how the synchronization works and how it fits into the rest of the pageserver can be found in the [storage module](./../remote_storage.rs) and its submodules.
Ideally, this document should disappear after current implementation concerns are mitigated, with the remaining useful knowledge bits moved into rustdocs.
## Approach
Backup functionality is a new component, appeared way after the core DB functionality was implemented.
Pageserver layer functionality is also quite volatile at the moment, there's a risk its local file management changes over time.
To avoid adding more chaos into that, backup functionality is currently designed as a relatively standalone component, with the majority of its logic placed in a standalone async loop.
This way, the backups are managed in background, not affecting directly other pageserver parts: this way the backup and restoration process may lag behind, but eventually keep up with the reality. To track that, a set of prometheus metrics is exposed from pageserver.
## What's done
Current implementation
* provides remote storage wrappers for AWS S3 and local FS
* uploads layers, frozen by pageserver checkpoint thread
* downloads and registers layers, found on the remote storage, but missing locally
No good optimisations or performance testing is done, the feature is disabled by default and gets polished over time.
It's planned to deal with all questions that are currently on and prepare the feature to be enabled by default in cloud environments.
### Peculiarities
As mentioned, the backup component is rather new and under development currently, so not all things are done properly from the start.
Here's the list of known compromises with comments:
* Remote storage model is the same as the `tenants/` directory contents of the pageserver's local workdir storage.
This is relatively simple to implement, but may be costly to use in AWS S3: an initial data image contains ~782 relish files and a metadata file, ~31 MB combined.
AWS charges per API call and for traffic either, layers are expected to be updated frequently, so this model most probably is ineffective.
Additionally, pageservers might need to migrate images between tenants, which does not improve the situation.
Storage sync API operates images when backing up or restoring a backup, so we're fluent to repack the layer contents the way we want to, which most probably will be done later.
* no proper file comparison
Currently, every layer contains `Lsn` in their name, to map the data it holds against a certain DB state.
Then the images with same ids and different `Lsn`'s are compared, files are considered equal if their local file paths are equal (for remote files, "local file path" is their download destination).
No file contents assertion is done currently, but should be.
AWS S3 returns file checksums during the `list` operation, so that can be used to ensure the backup consistency, but that needs further research and, since current pageserver impl also needs to deal with layer file checksums.
For now, due to this, we consider local workdir files as source of truth, not removing them ever and adjusting remote files instead, if image files mismatch.
* sad rust-s3 api
rust-s3 is not very pleasant to use:
1. it returns `anyhow::Result` and it's hard to distinguish "missing file" cases from "no connection" one, for instance
2. at least one function it its API that we need (`get_object_stream`) has `async` keyword and blocks (!), see details [here](https://github.com/zenithdb/zenith/pull/752#discussion_r728373091)
3. it's a prerelease library with unclear maintenance status
4. noisy on debug level
But it's already used in the project, so for now it's reused to avoid bloating the dependency tree.
Based on previous evaluation, even `rusoto-s3` could be a better choice over this library, but needs further benchmarking.
* gc and branches are ignored
So far, we don't consider non-main images and don't adjust the remote storage based on GC thread loop results.
Only checkpointer loop affects the remote storage.
* more layers should be downloaded on demand
Since we download and load remote layers into pageserver, there's a possibility a need for those layers' ancestors arise.
Most probably, every downloaded image's ancestor is not present in locally too, but currently there's no logic for downloading such ancestors and their metadata,
so the pageserver is unable to respond property on requests to such ancestors.
To implement the downloading, more `tenant_mgr` refactoring is needed to properly handle web requests for layers and handle the state changes.
[Here](https://github.com/zenithdb/zenith/pull/689#issuecomment-931216193) are the details about initial state management updates needed.
* no IT tests
Automated S3 testing is lacking currently, due to no convenient way to enable backups during the tests.
After it's fixed, benchmark runs should also be carried out to find bottlenecks.

View File

@@ -0,0 +1,689 @@
//! Local filesystem acting as a remote storage.
//! Multiple pageservers can use the same "storage" of this kind by using different storage roots.
//!
//! This storage used in pageserver tests, but can also be used in cases when a certain persistent
//! volume is mounted to the local FS.
use std::{
future::Future,
path::{Path, PathBuf},
pin::Pin,
};
use anyhow::{bail, ensure, Context};
use tokio::{
fs,
io::{self, AsyncReadExt, AsyncSeekExt, AsyncWriteExt},
};
use tracing::*;
use super::{strip_path_prefix, RemoteStorage};
pub struct LocalFs {
pageserver_workdir: &'static Path,
root: PathBuf,
}
impl LocalFs {
/// Attempts to create local FS storage, along with its root directory.
pub fn new(root: PathBuf, pageserver_workdir: &'static Path) -> anyhow::Result<Self> {
if !root.exists() {
std::fs::create_dir_all(&root).with_context(|| {
format!(
"Failed to create all directories in the given root path '{}'",
root.display(),
)
})?;
}
Ok(Self {
pageserver_workdir,
root,
})
}
fn resolve_in_storage(&self, path: &Path) -> anyhow::Result<PathBuf> {
if path.is_relative() {
Ok(self.root.join(path))
} else if path.starts_with(&self.root) {
Ok(path.to_path_buf())
} else {
bail!(
"Path '{}' does not belong to the current storage",
path.display()
)
}
}
}
#[async_trait::async_trait]
impl RemoteStorage for LocalFs {
type StoragePath = PathBuf;
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath> {
Ok(self.root.join(
strip_path_prefix(self.pageserver_workdir, local_path)
.context("local path does not belong to this storage")?,
))
}
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf> {
let relative_path = strip_path_prefix(&self.root, storage_path)
.context("local path does not belong to this storage")?;
Ok(self.pageserver_workdir.join(relative_path))
}
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>> {
Ok(get_all_files(&self.root).await?.into_iter().collect())
}
async fn upload(
&self,
mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
) -> anyhow::Result<()> {
let target_file_path = self.resolve_in_storage(to)?;
create_target_directory(&target_file_path).await?;
let mut destination = io::BufWriter::new(
fs::OpenOptions::new()
.write(true)
.create(true)
.open(&target_file_path)
.await
.with_context(|| {
format!(
"Failed to open target fs destination at '{}'",
target_file_path.display()
)
})?,
);
io::copy(&mut from, &mut destination)
.await
.with_context(|| {
format!(
"Failed to upload file to the local storage at '{}'",
target_file_path.display()
)
})?;
destination.flush().await.with_context(|| {
format!(
"Failed to upload file to the local storage at '{}'",
target_file_path.display()
)
})?;
Ok(())
}
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
let file_path = self.resolve_in_storage(from)?;
if file_path.exists() && file_path.is_file() {
let mut source = io::BufReader::new(
fs::OpenOptions::new()
.read(true)
.open(&file_path)
.await
.with_context(|| {
format!(
"Failed to open source file '{}' to use in the download",
file_path.display()
)
})?,
);
io::copy(&mut source, to).await.with_context(|| {
format!(
"Failed to download file '{}' from the local storage",
file_path.display()
)
})?;
source.flush().await?;
Ok(())
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
async fn download_range(
&self,
from: &Self::StoragePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
if let Some(end_exclusive) = end_exclusive {
ensure!(
end_exclusive > start_inclusive,
"Invalid range, start ({}) is bigger then end ({:?})",
start_inclusive,
end_exclusive
);
if start_inclusive == end_exclusive.saturating_sub(1) {
return Ok(());
}
}
let file_path = self.resolve_in_storage(from)?;
if file_path.exists() && file_path.is_file() {
let mut source = io::BufReader::new(
fs::OpenOptions::new()
.read(true)
.open(&file_path)
.await
.with_context(|| {
format!(
"Failed to open source file '{}' to use in the download",
file_path.display()
)
})?,
);
source
.seek(io::SeekFrom::Start(start_inclusive))
.await
.context("Failed to seek to the range start in a local storage file")?;
match end_exclusive {
Some(end_exclusive) => {
io::copy(&mut source.take(end_exclusive - start_inclusive), to).await
}
None => io::copy(&mut source, to).await,
}
.with_context(|| {
format!(
"Failed to download file '{}' range from the local storage",
file_path.display()
)
})?;
Ok(())
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> {
let file_path = self.resolve_in_storage(path)?;
if file_path.exists() && file_path.is_file() {
Ok(fs::remove_file(file_path).await?)
} else {
bail!(
"File '{}' either does not exist or is not a file",
file_path.display()
)
}
}
}
fn get_all_files<'a, P>(
directory_path: P,
) -> Pin<Box<dyn Future<Output = anyhow::Result<Vec<PathBuf>>> + Send + Sync + 'a>>
where
P: AsRef<Path> + Send + Sync + 'a,
{
Box::pin(async move {
let directory_path = directory_path.as_ref();
if directory_path.exists() {
if directory_path.is_dir() {
let mut paths = Vec::new();
let mut dir_contents = fs::read_dir(directory_path).await?;
while let Some(dir_entry) = dir_contents.next_entry().await? {
let file_type = dir_entry.file_type().await?;
let entry_path = dir_entry.path();
if file_type.is_symlink() {
debug!("{:?} us a symlink, skipping", entry_path)
} else if file_type.is_dir() {
paths.extend(get_all_files(entry_path).await?.into_iter())
} else {
paths.push(dir_entry.path());
}
}
Ok(paths)
} else {
bail!("Path '{}' is not a directory", directory_path.display())
}
} else {
Ok(Vec::new())
}
})
}
async fn create_target_directory(target_file_path: &Path) -> anyhow::Result<()> {
let target_dir = match target_file_path.parent() {
Some(parent_dir) => parent_dir,
None => bail!(
"File path '{}' has no parent directory",
target_file_path.display()
),
};
if !target_dir.exists() {
fs::create_dir_all(target_dir).await?;
}
Ok(())
}
#[cfg(test)]
mod pure_tests {
use crate::{
layered_repository::metadata::METADATA_FILE_NAME,
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use super::*;
#[test]
fn storage_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("storage_path_positive")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root.clone(),
};
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("file_name");
let expected_path = storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?);
assert_eq!(
expected_path,
storage.storage_path(&local_path).expect("Matching path should map to storage path normally"),
"File paths from pageserver workdir should be stored in local fs storage with the same path they have relative to the workdir"
);
Ok(())
}
#[test]
fn storage_path_negatives() -> anyhow::Result<()> {
#[track_caller]
fn storage_path_error(storage: &LocalFs, mismatching_path: &Path) -> String {
match storage.storage_path(mismatching_path) {
Ok(wrong_path) => panic!(
"Expected path '{}' to error, but got storage path: {:?}",
mismatching_path.display(),
wrong_path,
),
Err(e) => format!("{:?}", e),
}
}
let repo_harness = RepoHarness::create("storage_path_negatives")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
};
let error_string = storage_path_error(&storage, &repo_harness.conf.workdir);
assert!(error_string.contains("does not belong to this storage"));
assert!(error_string.contains(repo_harness.conf.workdir.to_str().unwrap()));
let mismatching_path_str = "/something/else";
let error_message = storage_path_error(&storage, Path::new(mismatching_path_str));
assert!(
error_message.contains(mismatching_path_str),
"Error should mention wrong path"
);
assert!(
error_message.contains(repo_harness.conf.workdir.to_str().unwrap()),
"Error should mention server workdir"
);
assert!(error_message.contains("does not belong to this storage"));
Ok(())
}
#[test]
fn local_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("local_path_positive")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root.clone(),
};
let name = "not a metadata";
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join(name);
assert_eq!(
local_path,
storage
.local_path(
&storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?)
)
.expect("For a valid input, valid S3 info should be parsed"),
"Should be able to parse metadata out of the correctly named remote delta file"
);
let local_metadata_path = repo_harness
.timeline_path(&TIMELINE_ID)
.join(METADATA_FILE_NAME);
let remote_metadata_path = storage.storage_path(&local_metadata_path)?;
assert_eq!(
local_metadata_path,
storage
.local_path(&remote_metadata_path)
.expect("For a valid input, valid local path should be parsed"),
"Should be able to parse metadata out of the correctly named remote metadata file"
);
Ok(())
}
#[test]
fn local_path_negatives() -> anyhow::Result<()> {
#[track_caller]
#[allow(clippy::ptr_arg)] // have to use &PathBuf due to `storage.local_path` parameter requirements
fn local_path_error(storage: &LocalFs, storage_path: &PathBuf) -> String {
match storage.local_path(storage_path) {
Ok(wrong_path) => panic!(
"Expected local path input {:?} to cause an error, but got file path: {:?}",
storage_path, wrong_path,
),
Err(e) => format!("{:?}", e),
}
}
let repo_harness = RepoHarness::create("local_path_negatives")?;
let storage_root = PathBuf::from("somewhere").join("else");
let storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
};
let totally_wrong_path = "wrong_wrong_wrong";
let error_message = local_path_error(&storage, &PathBuf::from(totally_wrong_path));
assert!(error_message.contains(totally_wrong_path));
Ok(())
}
#[test]
fn download_destination_matches_original_path() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination_matches_original_path")?;
let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name");
let storage_root = PathBuf::from("somewhere").join("else");
let dummy_storage = LocalFs {
pageserver_workdir: &repo_harness.conf.workdir,
root: storage_root,
};
let storage_path = dummy_storage.storage_path(&original_path)?;
let download_destination = dummy_storage.local_path(&storage_path)?;
assert_eq!(
original_path, download_destination,
"'original path -> storage path -> matching fs path' transformation should produce the same path as the input one for the correct path"
);
Ok(())
}
}
#[cfg(test)]
mod fs_tests {
use super::*;
use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID};
use std::io::Write;
use tempfile::tempdir;
#[tokio::test]
async fn upload_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("upload_file")?;
let storage = create_storage()?;
let source = create_file_for_upload(
&storage.pageserver_workdir.join("whatever"),
"whatever_contents",
)
.await?;
let target_path = PathBuf::from("/").join("somewhere").join("else");
match storage.upload(source, &target_path).await {
Ok(()) => panic!("Should not allow storing files with wrong target path"),
Err(e) => {
let message = format!("{:?}", e);
assert!(message.contains(&target_path.display().to_string()));
assert!(message.contains("does not belong to the current storage"));
}
}
assert!(storage.list().await?.is_empty());
let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1").await?;
assert_eq!(
storage.list().await?,
vec![target_path_1.clone()],
"Should list a single file after first upload"
);
let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2").await?;
assert_eq!(
list_files_sorted(&storage).await?,
vec![target_path_1.clone(), target_path_2.clone()],
"Should list a two different files after second upload"
);
Ok(())
}
fn create_storage() -> anyhow::Result<LocalFs> {
let pageserver_workdir = Box::leak(Box::new(tempdir()?.path().to_owned()));
let storage = LocalFs::new(tempdir()?.path().to_owned(), pageserver_workdir)?;
Ok(storage)
}
#[tokio::test]
async fn download_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage.download(&upload_target, &mut content_bytes).await?;
content_bytes.flush().await?;
let contents = String::from_utf8(content_bytes.into_inner().into_inner())?;
assert_eq!(
dummy_contents(upload_name),
contents,
"We should upload and download the same contents"
);
let non_existing_path = PathBuf::from("somewhere").join("else");
match storage.download(&non_existing_path, &mut io::sink()).await {
Ok(_) => panic!("Should not allow downloading non-existing storage files"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("does not exist"));
assert!(error_string.contains(&non_existing_path.display().to_string()));
}
}
Ok(())
}
#[tokio::test]
async fn download_file_range_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file_range_positive")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let mut full_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
.download_range(&upload_target, 0, None, &mut full_range_bytes)
.await?;
full_range_bytes.flush().await?;
assert_eq!(
dummy_contents(upload_name),
String::from_utf8(full_range_bytes.into_inner().into_inner())?,
"Download full range should return the whole upload"
);
let mut zero_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
let same_byte = 1_000_000_000;
storage
.download_range(
&upload_target,
same_byte,
Some(same_byte + 1), // exclusive end
&mut zero_range_bytes,
)
.await?;
zero_range_bytes.flush().await?;
assert!(
zero_range_bytes.into_inner().into_inner().is_empty(),
"Zero byte range should not download any part of the file"
);
let uploaded_bytes = dummy_contents(upload_name).into_bytes();
let (first_part_local, second_part_local) = uploaded_bytes.split_at(3);
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
.download_range(
&upload_target,
0,
Some(first_part_local.len() as u64),
&mut first_part_remote,
)
.await?;
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
assert_eq!(
first_part_local,
first_part_remote.as_slice(),
"First part bytes should be returned when requrested"
);
let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
storage
.download_range(
&upload_target,
first_part_local.len() as u64,
Some((first_part_local.len() + second_part_local.len()) as u64),
&mut second_part_remote,
)
.await?;
second_part_remote.flush().await?;
let second_part_remote = second_part_remote.into_inner().into_inner();
assert_eq!(
second_part_local,
second_part_remote.as_slice(),
"Second part bytes should be returned when requrested"
);
Ok(())
}
#[tokio::test]
async fn download_file_range_negative() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_file_range_negative")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
let start = 10000;
let end = 234;
assert!(start > end, "Should test an incorrect range");
match storage
.download_range(&upload_target, start, Some(end), &mut io::sink())
.await
{
Ok(_) => panic!("Should not allow downloading wrong ranges"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("Invalid range"));
assert!(error_string.contains(&start.to_string()));
assert!(error_string.contains(&end.to_string()));
}
}
let non_existing_path = PathBuf::from("somewhere").join("else");
match storage
.download_range(&non_existing_path, 1, Some(3), &mut io::sink())
.await
{
Ok(_) => panic!("Should not allow downloading non-existing storage file ranges"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("does not exist"));
assert!(error_string.contains(&non_existing_path.display().to_string()));
}
}
Ok(())
}
#[tokio::test]
async fn delete_file() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("delete_file")?;
let storage = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?;
storage.delete(&upload_target).await?;
assert!(storage.list().await?.is_empty());
match storage.delete(&upload_target).await {
Ok(()) => panic!("Should not allow deleting non-existing storage files"),
Err(e) => {
let error_string = e.to_string();
assert!(error_string.contains("does not exist"));
assert!(error_string.contains(&upload_target.display().to_string()));
}
}
Ok(())
}
async fn upload_dummy_file(
harness: &RepoHarness,
storage: &LocalFs,
name: &str,
) -> anyhow::Result<PathBuf> {
let timeline_path = harness.timeline_path(&TIMELINE_ID);
let relative_timeline_path = timeline_path.strip_prefix(&harness.conf.workdir)?;
let storage_path = storage.root.join(relative_timeline_path).join(name);
storage
.upload(
create_file_for_upload(
&storage.pageserver_workdir.join(name),
&dummy_contents(name),
)
.await?,
&storage_path,
)
.await?;
Ok(storage_path)
}
async fn create_file_for_upload(
path: &Path,
contents: &str,
) -> anyhow::Result<io::BufReader<fs::File>> {
std::fs::create_dir_all(path.parent().unwrap())?;
let mut file_for_writing = std::fs::OpenOptions::new()
.write(true)
.create_new(true)
.open(path)?;
write!(file_for_writing, "{}", contents)?;
drop(file_for_writing);
Ok(io::BufReader::new(
fs::OpenOptions::new().read(true).open(&path).await?,
))
}
fn dummy_contents(name: &str) -> String {
format!("contents for {}", name)
}
async fn list_files_sorted(storage: &LocalFs) -> anyhow::Result<Vec<PathBuf>> {
let mut files = storage.list().await?;
files.sort();
Ok(files)
}
}

View File

@@ -0,0 +1,373 @@
//! AWS S3 storage wrapper around `rust_s3` library.
//! Currently does not allow multiple pageservers to use the same bucket concurrently: objects are
//! placed in the root of the bucket.
use std::path::{Path, PathBuf};
use anyhow::Context;
use s3::{bucket::Bucket, creds::Credentials, region::Region};
use tokio::io::{self, AsyncWriteExt};
use crate::{
remote_storage::{strip_path_prefix, RemoteStorage},
S3Config,
};
const S3_FILE_SEPARATOR: char = '/';
#[derive(Debug, Eq, PartialEq)]
pub struct S3ObjectKey(String);
impl S3ObjectKey {
fn key(&self) -> &str {
&self.0
}
fn download_destination(&self, pageserver_workdir: &Path) -> PathBuf {
pageserver_workdir.join(self.0.split(S3_FILE_SEPARATOR).collect::<PathBuf>())
}
}
/// AWS S3 storage.
pub struct S3 {
pageserver_workdir: &'static Path,
bucket: Bucket,
}
impl S3 {
/// Creates the storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config, pageserver_workdir: &'static Path) -> anyhow::Result<Self> {
let region = aws_config
.bucket_region
.parse::<Region>()
.context("Failed to parse the s3 region from config")?;
let credentials = Credentials::new(
aws_config.access_key_id.as_deref(),
aws_config.secret_access_key.as_deref(),
None,
None,
None,
)
.context("Failed to create the s3 credentials")?;
Ok(Self {
bucket: Bucket::new_with_path_style(
aws_config.bucket_name.as_str(),
region,
credentials,
)
.context("Failed to create the s3 bucket")?,
pageserver_workdir,
})
}
}
#[async_trait::async_trait]
impl RemoteStorage for S3 {
type StoragePath = S3ObjectKey;
fn storage_path(&self, local_path: &Path) -> anyhow::Result<Self::StoragePath> {
let relative_path = strip_path_prefix(self.pageserver_workdir, local_path)?;
let mut key = String::new();
for segment in relative_path {
key.push(S3_FILE_SEPARATOR);
key.push_str(&segment.to_string_lossy());
}
Ok(S3ObjectKey(key))
}
fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result<PathBuf> {
Ok(storage_path.download_destination(self.pageserver_workdir))
}
async fn list(&self) -> anyhow::Result<Vec<Self::StoragePath>> {
let list_response = self
.bucket
.list(String::new(), None)
.await
.context("Failed to list s3 objects")?;
Ok(list_response
.into_iter()
.flat_map(|response| response.contents)
.map(|s3_object| S3ObjectKey(s3_object.key))
.collect())
}
async fn upload(
&self,
mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
to: &Self::StoragePath,
) -> anyhow::Result<()> {
let mut upload_contents = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
io::copy(&mut from, &mut upload_contents)
.await
.context("Failed to read the upload contents")?;
upload_contents
.flush()
.await
.context("Failed to read the upload contents")?;
let upload_contents = upload_contents.into_inner().into_inner();
let (_, code) = self
.bucket
.put_object(to.key(), &upload_contents)
.await
.with_context(|| format!("Failed to create s3 object with key {}", to.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during creating object with key '{}', code: {}",
to.key(),
code
))
} else {
Ok(())
}
}
async fn download(
&self,
from: &Self::StoragePath,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
let (data, code) = self
.bucket
.get_object(from.key())
.await
.with_context(|| format!("Failed to download s3 object with key {}", from.key()))?;
if code != 200 {
Err(anyhow::format_err!(
"Received non-200 exit code during downloading object, code: {}",
code
))
} else {
// we don't have to write vector into the destination this way, `to_write_all` would be enough.
// but we want to prepare for migration on `rusoto`, that has a streaming HTTP body instead here, with
// which it makes more sense to use `io::copy`.
io::copy(&mut data.as_slice(), to)
.await
.context("Failed to write downloaded data into the destination buffer")?;
Ok(())
}
}
async fn download_range(
&self,
from: &Self::StoragePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
to: &mut (impl io::AsyncWrite + Unpin + Send + Sync),
) -> anyhow::Result<()> {
// S3 accepts ranges as https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
// and needs both ends to be exclusive
let end_inclusive = end_exclusive.map(|end| end.saturating_sub(1));
let (data, code) = self
.bucket
.get_object_range(from.key(), start_inclusive, end_inclusive)
.await
.with_context(|| format!("Failed to download s3 object with key {}", from.key()))?;
if code != 206 {
Err(anyhow::format_err!(
"Received non-206 exit code during downloading object range, code: {}",
code
))
} else {
// see `download` function above for the comment on why `Vec<u8>` buffer is copied this way
io::copy(&mut data.as_slice(), to)
.await
.context("Failed to write downloaded range into the destination buffer")?;
Ok(())
}
}
async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> {
let (_, code) = self
.bucket
.delete_object(path.key())
.await
.with_context(|| format!("Failed to delete s3 object with key {}", path.key()))?;
if code != 204 {
Err(anyhow::format_err!(
"Received non-204 exit code during deleting object with key '{}', code: {}",
path.key(),
code
))
} else {
Ok(())
}
}
}
#[cfg(test)]
mod tests {
use crate::{
layered_repository::metadata::METADATA_FILE_NAME,
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use super::*;
#[test]
fn download_destination() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination")?;
let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("test_name");
let relative_path = local_path.strip_prefix(&repo_harness.conf.workdir)?;
let key = S3ObjectKey(format!(
"{}{}",
S3_FILE_SEPARATOR,
relative_path
.iter()
.map(|segment| segment.to_str().unwrap())
.collect::<Vec<_>>()
.join(&S3_FILE_SEPARATOR.to_string()),
));
assert_eq!(
local_path,
key.download_destination(&repo_harness.conf.workdir),
"Download destination should consist of s3 path joined with the pageserver workdir prefix"
);
Ok(())
}
#[test]
fn storage_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("storage_path_positive")?;
let segment_1 = "matching";
let segment_2 = "file";
let local_path = &repo_harness.conf.workdir.join(segment_1).join(segment_2);
let expected_key = S3ObjectKey(format!(
"{SEPARATOR}{}{SEPARATOR}{}",
segment_1,
segment_2,
SEPARATOR = S3_FILE_SEPARATOR,
));
let actual_key = dummy_storage(&repo_harness.conf.workdir)
.storage_path(local_path)
.expect("Matching path should map to S3 path normally");
assert_eq!(
expected_key,
actual_key,
"S3 key from the matching path should contain all segments after the workspace prefix, separated with S3 separator"
);
Ok(())
}
#[test]
fn storage_path_negatives() -> anyhow::Result<()> {
#[track_caller]
fn storage_path_error(storage: &S3, mismatching_path: &Path) -> String {
match storage.storage_path(mismatching_path) {
Ok(wrong_key) => panic!(
"Expected path '{}' to error, but got S3 key: {:?}",
mismatching_path.display(),
wrong_key,
),
Err(e) => e.to_string(),
}
}
let repo_harness = RepoHarness::create("storage_path_negatives")?;
let storage = dummy_storage(&repo_harness.conf.workdir);
let error_message = storage_path_error(&storage, &repo_harness.conf.workdir);
assert!(
error_message.contains("Prefix and the path are equal"),
"Message '{}' does not contain the required string",
error_message
);
let mismatching_path = PathBuf::from("somewhere").join("else");
let error_message = storage_path_error(&storage, &mismatching_path);
assert!(
error_message.contains(mismatching_path.to_str().unwrap()),
"Error should mention wrong path"
);
assert!(
error_message.contains(repo_harness.conf.workdir.to_str().unwrap()),
"Error should mention server workdir"
);
assert!(
error_message.contains("is not prefixed with"),
"Message '{}' does not contain a required string",
error_message
);
Ok(())
}
#[test]
fn local_path_positive() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("local_path_positive")?;
let storage = dummy_storage(&repo_harness.conf.workdir);
let timeline_dir = repo_harness.timeline_path(&TIMELINE_ID);
let relative_timeline_path = timeline_dir.strip_prefix(&repo_harness.conf.workdir)?;
let s3_key = create_s3_key(&relative_timeline_path.join("not a metadata"));
assert_eq!(
s3_key.download_destination(&repo_harness.conf.workdir),
storage
.local_path(&s3_key)
.expect("For a valid input, valid S3 info should be parsed"),
"Should be able to parse metadata out of the correctly named remote delta file"
);
let s3_key = create_s3_key(&relative_timeline_path.join(METADATA_FILE_NAME));
assert_eq!(
s3_key.download_destination(&repo_harness.conf.workdir),
storage
.local_path(&s3_key)
.expect("For a valid input, valid S3 info should be parsed"),
"Should be able to parse metadata out of the correctly named remote metadata file"
);
Ok(())
}
#[test]
fn download_destination_matches_original_path() -> anyhow::Result<()> {
let repo_harness = RepoHarness::create("download_destination_matches_original_path")?;
let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name");
let dummy_storage = dummy_storage(&repo_harness.conf.workdir);
let key = dummy_storage.storage_path(&original_path)?;
let download_destination = dummy_storage.local_path(&key)?;
assert_eq!(
original_path, download_destination,
"'original path -> storage key -> matching fs path' transformation should produce the same path as the input one for the correct path"
);
Ok(())
}
fn dummy_storage(pageserver_workdir: &'static Path) -> S3 {
S3 {
pageserver_workdir,
bucket: Bucket::new(
"dummy-bucket",
"us-east-1".parse().unwrap(),
Credentials::anonymous().unwrap(),
)
.unwrap(),
}
}
fn create_s3_key(relative_file_path: &Path) -> S3ObjectKey {
S3ObjectKey(
relative_file_path
.iter()
.fold(String::new(), |mut path_string, segment| {
path_string.push(S3_FILE_SEPARATOR);
path_string.push_str(segment.to_str().unwrap());
path_string
}),
)
}
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -2,19 +2,17 @@
//! Import data and WAL from a PostgreSQL data directory and WAL segments into
//! zenith Timeline.
//!
use log::*;
use postgres_ffi::nonrelfile_utils::clogpage_precedes;
use postgres_ffi::nonrelfile_utils::slru_may_delete_clogsegment;
use std::cmp::min;
use std::fs;
use std::fs::File;
use std::io::Read;
use std::io::Seek;
use std::io::SeekFrom;
use std::io::{Read, Seek, SeekFrom};
use std::path::{Path, PathBuf};
use anyhow::Result;
use bytes::{Buf, Bytes};
use anyhow::{anyhow, bail, Result};
use bytes::{Buf, Bytes, BytesMut};
use tracing::*;
use crate::relish::*;
use crate::repository::*;
@@ -36,9 +34,11 @@ static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; 8192]);
///
pub fn import_timeline_from_postgres_datadir(
path: &Path,
timeline: &dyn Timeline,
writer: &dyn TimelineWriter,
lsn: Lsn,
) -> Result<()> {
let mut pg_control: Option<ControlFileData> = None;
// Scan 'global'
for direntry in fs::read_dir(path.join("global"))? {
let direntry = direntry?;
@@ -46,16 +46,10 @@ pub fn import_timeline_from_postgres_datadir(
None => continue,
Some("pg_control") => {
import_nonrel_file(timeline, lsn, RelishTag::ControlFile, &direntry.path())?;
// Extract checkpoint record from pg_control and store is as separate object
let pg_control_bytes =
timeline.get_page_at_lsn_nowait(RelishTag::ControlFile, 0, lsn)?;
let pg_control = ControlFileData::decode(&pg_control_bytes)?;
let checkpoint_bytes = pg_control.checkPointCopy.encode();
timeline.put_page_image(RelishTag::Checkpoint, 0, lsn, checkpoint_bytes)?;
pg_control = Some(import_control_file(writer, lsn, &direntry.path())?);
}
Some("pg_filenode.map") => import_nonrel_file(
timeline,
writer,
lsn,
RelishTag::FileNodeMap {
spcnode: pg_constants::GLOBALTABLESPACE_OID,
@@ -67,7 +61,7 @@ pub fn import_timeline_from_postgres_datadir(
// Load any relation files into the page server
_ => import_relfile(
&direntry.path(),
timeline,
writer,
lsn,
pg_constants::GLOBALTABLESPACE_OID,
0,
@@ -94,7 +88,7 @@ pub fn import_timeline_from_postgres_datadir(
Some("PG_VERSION") => continue,
Some("pg_filenode.map") => import_nonrel_file(
timeline,
writer,
lsn,
RelishTag::FileNodeMap {
spcnode: pg_constants::DEFAULTTABLESPACE_OID,
@@ -106,7 +100,7 @@ pub fn import_timeline_from_postgres_datadir(
// Load any relation files into the page server
_ => import_relfile(
&direntry.path(),
timeline,
writer,
lsn,
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
@@ -116,25 +110,36 @@ pub fn import_timeline_from_postgres_datadir(
}
for entry in fs::read_dir(path.join("pg_xact"))? {
let entry = entry?;
import_slru_file(timeline, lsn, SlruKind::Clog, &entry.path())?;
import_slru_file(writer, lsn, SlruKind::Clog, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("members"))? {
let entry = entry?;
import_slru_file(timeline, lsn, SlruKind::MultiXactMembers, &entry.path())?;
import_slru_file(writer, lsn, SlruKind::MultiXactMembers, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_multixact").join("offsets"))? {
let entry = entry?;
import_slru_file(timeline, lsn, SlruKind::MultiXactOffsets, &entry.path())?;
import_slru_file(writer, lsn, SlruKind::MultiXactOffsets, &entry.path())?;
}
for entry in fs::read_dir(path.join("pg_twophase"))? {
let entry = entry?;
let xid = u32::from_str_radix(entry.path().to_str().unwrap(), 16)?;
import_nonrel_file(timeline, lsn, RelishTag::TwoPhase { xid }, &entry.path())?;
import_nonrel_file(writer, lsn, RelishTag::TwoPhase { xid }, &entry.path())?;
}
// TODO: Scan pg_tblspc
timeline.advance_last_record_lsn(lsn.align());
timeline.checkpoint()?;
writer.advance_last_record_lsn(lsn);
// Import WAL. This is needed even when starting from a shutdown checkpoint, because
// this reads the checkpoint record itself, advancing the tip of the timeline to
// *after* the checkpoint record. And crucially, it initializes the 'prev_lsn'
let pg_control = pg_control.ok_or_else(|| anyhow!("pg_control file not found"))?;
import_wal(
&path.join("pg_wal"),
writer,
Lsn(pg_control.checkPointCopy.redo),
lsn,
&mut pg_control.checkPointCopy.clone(),
)?;
Ok(())
}
@@ -142,12 +147,13 @@ pub fn import_timeline_from_postgres_datadir(
// subroutine of import_timeline_from_postgres_datadir(), to load one relation file.
fn import_relfile(
path: &Path,
timeline: &dyn Timeline,
timeline: &dyn TimelineWriter,
lsn: Lsn,
spcoid: Oid,
dboid: Oid,
) -> Result<()> {
// Does it look like a relation file?
trace!("importing rel file {}", path.display());
let p = parse_relfilename(path.file_name().unwrap().to_str().unwrap());
if let Err(e) = p {
@@ -175,15 +181,14 @@ fn import_relfile(
}
// TODO: UnexpectedEof is expected
Err(e) => match e.kind() {
Err(err) => match err.kind() {
std::io::ErrorKind::UnexpectedEof => {
// reached EOF. That's expected.
// FIXME: maybe check that we read the full length of the file?
break;
}
_ => {
error!("error reading file: {:?} ({})", path, e);
break;
bail!("error reading file {}: {:#}", path.display(), err);
}
},
};
@@ -200,7 +205,7 @@ fn import_relfile(
/// are just slurped into the repository as one blob.
///
fn import_nonrel_file(
timeline: &dyn Timeline,
timeline: &dyn TimelineWriter,
lsn: Lsn,
tag: RelishTag,
path: &Path,
@@ -210,22 +215,60 @@ fn import_nonrel_file(
// read the whole file
file.read_to_end(&mut buffer)?;
info!("importing non-rel file {}", path.display());
trace!("importing non-rel file {}", path.display());
timeline.put_page_image(tag, 0, lsn, Bytes::copy_from_slice(&buffer[..]))?;
Ok(())
}
///
/// Import pg_control file into the repository.
///
/// The control file is imported as is, but we also extract the checkpoint record
/// from it and store it separated.
fn import_control_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
path: &Path,
) -> Result<ControlFileData> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
// read the whole file
file.read_to_end(&mut buffer)?;
trace!("importing control file {}", path.display());
// Import it as ControlFile
timeline.put_page_image(
RelishTag::ControlFile,
0,
lsn,
Bytes::copy_from_slice(&buffer[..]),
)?;
// Extract the checkpoint record and import it separately.
let pg_control = ControlFileData::decode(&buffer)?;
let checkpoint_bytes = pg_control.checkPointCopy.encode();
timeline.put_page_image(RelishTag::Checkpoint, 0, lsn, checkpoint_bytes)?;
Ok(pg_control)
}
///
/// Import an SLRU segment file
///
fn import_slru_file(timeline: &dyn Timeline, lsn: Lsn, slru: SlruKind, path: &Path) -> Result<()> {
fn import_slru_file(
timeline: &dyn TimelineWriter,
lsn: Lsn,
slru: SlruKind,
path: &Path,
) -> Result<()> {
// Does it look like an SLRU file?
let mut file = File::open(path)?;
let mut buf: [u8; 8192] = [0u8; 8192];
let segno = u32::from_str_radix(path.file_name().unwrap().to_str().unwrap(), 16)?;
info!("importing slru file {}", path.display());
trace!("importing slru file {}", path.display());
let mut rpageno = 0;
loop {
@@ -241,15 +284,14 @@ fn import_slru_file(timeline: &dyn Timeline, lsn: Lsn, slru: SlruKind, path: &Pa
}
// TODO: UnexpectedEof is expected
Err(e) => match e.kind() {
Err(err) => match err.kind() {
std::io::ErrorKind::UnexpectedEof => {
// reached EOF. That's expected.
// FIXME: maybe check that we read the full length of the file?
break;
}
_ => {
error!("error reading file: {:?} ({})", path, e);
break;
bail!("error reading file {}: {:#}", path.display(), err);
}
},
};
@@ -261,24 +303,27 @@ fn import_slru_file(timeline: &dyn Timeline, lsn: Lsn, slru: SlruKind, path: &Pa
Ok(())
}
/// Scan PostgreSQL WAL files in given directory
/// and load all records >= 'startpoint' into the repository.
pub fn import_timeline_wal(walpath: &Path, timeline: &dyn Timeline, startpoint: Lsn) -> Result<()> {
/// Scan PostgreSQL WAL files in given directory and load all records between
/// 'startpoint' and 'endpoint' into the repository.
fn import_wal(
walpath: &Path,
timeline: &dyn TimelineWriter,
startpoint: Lsn,
endpoint: Lsn,
checkpoint: &mut CheckPoint,
) -> Result<()> {
let mut waldecoder = WalStreamDecoder::new(startpoint);
let mut segno = startpoint.segment_number(pg_constants::WAL_SEGMENT_SIZE);
let mut offset = startpoint.segment_offset(pg_constants::WAL_SEGMENT_SIZE);
let mut last_lsn = startpoint;
let checkpoint_bytes = timeline.get_page_at_lsn_nowait(RelishTag::Checkpoint, 0, startpoint)?;
let mut checkpoint = CheckPoint::decode(&checkpoint_bytes)?;
loop {
while last_lsn <= endpoint {
// FIXME: assume postgresql tli 1 for now
let filename = XLogFileName(1, segno, pg_constants::WAL_SEGMENT_SIZE);
let mut buf = Vec::new();
//Read local file
// Read local file
let mut path = walpath.join(&filename);
// It could be as .partial
@@ -287,13 +332,7 @@ pub fn import_timeline_wal(walpath: &Path, timeline: &dyn Timeline, startpoint:
}
// Slurp the WAL file
let open_result = File::open(&path);
if let Err(e) = &open_result {
if e.kind() == std::io::ErrorKind::NotFound {
break;
}
}
let mut file = open_result?;
let mut file = File::open(&path)?;
if offset > 0 {
file.seek(SeekFrom::Start(offset as u64))?;
@@ -308,36 +347,57 @@ pub fn import_timeline_wal(walpath: &Path, timeline: &dyn Timeline, startpoint:
waldecoder.feed_bytes(&buf);
let mut nrecords = 0;
loop {
let rec = waldecoder.poll_decode();
if rec.is_err() {
// Assume that an error means we've reached the end of
// a partial WAL record. So that's ok.
trace!("WAL decoder error {:?}", rec);
break;
}
if let Some((lsn, recdata)) = rec.unwrap() {
while last_lsn <= endpoint {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
let mut checkpoint_modified = false;
let decoded = decode_wal_record(recdata.clone());
save_decoded_record(&mut checkpoint, timeline, &decoded, recdata, lsn)?;
save_decoded_record(
checkpoint,
&mut checkpoint_modified,
timeline,
&decoded,
recdata,
lsn,
)?;
last_lsn = lsn;
} else {
break;
if checkpoint_modified {
let checkpoint_bytes = checkpoint.encode();
timeline.put_page_image(
RelishTag::Checkpoint,
0,
last_lsn,
checkpoint_bytes,
)?;
}
// Now that this record has been fully handled, including updating the
// checkpoint data, let the repository know that it is up-to-date to this LSN
timeline.advance_last_record_lsn(last_lsn);
nrecords += 1;
trace!("imported record at {} (end {})", lsn, endpoint);
}
nrecords += 1;
}
info!("imported {} records up to {}", nrecords, last_lsn);
debug!("imported {} records up to {}", nrecords, last_lsn);
segno += 1;
offset = 0;
}
info!("reached end of WAL at {}", last_lsn);
let checkpoint_bytes = checkpoint.encode();
timeline.put_page_image(RelishTag::Checkpoint, 0, last_lsn, checkpoint_bytes)?;
if last_lsn != startpoint {
debug!(
"reached end of WAL at {}, updating checkpoint info",
last_lsn
);
timeline.advance_last_record_lsn(last_lsn);
} else {
info!("no WAL to import at {}", last_lsn);
}
timeline.advance_last_record_lsn(last_lsn.align());
timeline.checkpoint()?;
Ok(())
}
@@ -347,13 +407,15 @@ pub fn import_timeline_wal(walpath: &Path, timeline: &dyn Timeline, startpoint:
///
pub fn save_decoded_record(
checkpoint: &mut CheckPoint,
timeline: &dyn Timeline,
checkpoint_modified: &mut bool,
timeline: &dyn TimelineWriter,
decoded: &DecodedWALRecord,
recdata: Bytes,
lsn: Lsn,
) -> Result<()> {
checkpoint.update_next_xid(decoded.xl_xid);
if checkpoint.update_next_xid(decoded.xl_xid) {
*checkpoint_modified = true;
}
// Iterate through all the blocks that the record modifies, and
// "put" a separate copy of the record for each block.
for blk in decoded.blocks.iter() {
@@ -364,14 +426,43 @@ pub fn save_decoded_record(
forknum: blk.forknum as u8,
});
let rec = WALRecord {
lsn,
will_init: blk.will_init || blk.apply_image,
rec: recdata.clone(),
main_data_offset: decoded.main_data_offset as u32,
};
//
// Instead of storing full-page-image WAL record,
// it is better to store extracted image: we can skip wal-redo
// in this case. Also some FPI records may contain multiple (up to 32) pages,
// so them have to be copied multiple times.
//
if blk.apply_image
&& blk.has_image
&& decoded.xl_rmid == pg_constants::RM_XLOG_ID
&& (decoded.xl_info == pg_constants::XLOG_FPI
|| decoded.xl_info == pg_constants::XLOG_FPI_FOR_HINT)
// compression of WAL is not yet supported: fall back to storing the original WAL record
&& (blk.bimg_info & pg_constants::BKPIMAGE_IS_COMPRESSED) == 0
{
// Extract page image from FPI record
let img_len = blk.bimg_len as usize;
let img_offs = blk.bimg_offset as usize;
let mut image = BytesMut::with_capacity(pg_constants::BLCKSZ as usize);
image.extend_from_slice(&recdata[img_offs..img_offs + img_len]);
timeline.put_wal_record(tag, blk.blkno, rec)?;
if blk.hole_length != 0 {
let tail = image.split_off(blk.hole_offset as usize);
image.resize(image.len() + blk.hole_length as usize, 0u8);
image.unsplit(tail);
}
image[0..4].copy_from_slice(&((lsn.0 >> 32) as u32).to_le_bytes());
image[4..8].copy_from_slice(&(lsn.0 as u32).to_le_bytes());
assert_eq!(image.len(), pg_constants::BLCKSZ as usize);
timeline.put_page_image(tag, blk.blkno, lsn, image.freeze())?;
} else {
let rec = WALRecord {
will_init: blk.will_init || blk.apply_image,
rec: recdata.clone(),
main_data_offset: decoded.main_data_offset as u32,
};
timeline.put_wal_record(lsn, tag, blk.blkno, rec)?;
}
}
let mut buf = decoded.record.clone();
@@ -392,10 +483,14 @@ pub fn save_decoded_record(
{
let dropdb = XlDropDatabase::decode(&mut buf);
// To drop the database, we need to drop all the relations in it. Like in
// save_xlog_dbase_create(), use the previous record's LSN in the list_rels() call
let req_lsn = min(timeline.get_last_record_lsn(), lsn);
for tablespace_id in dropdb.tablespace_ids {
let rels = timeline.list_rels(tablespace_id, dropdb.db_id, lsn)?;
let rels = timeline.list_rels(tablespace_id, dropdb.db_id, req_lsn)?;
for rel in rels {
timeline.drop_relish(RelishTag::Relation(rel), lsn)?;
timeline.drop_relish(rel, lsn)?;
}
trace!(
"Drop FileNodeMap {}, {} at lsn {}",
@@ -432,7 +527,7 @@ pub fn save_decoded_record(
} else {
assert!(info == pg_constants::CLOG_TRUNCATE);
let xlrec = XlClogTruncate::decode(&mut buf);
save_clog_truncate_record(checkpoint, timeline, lsn, &xlrec)?;
save_clog_truncate_record(checkpoint, checkpoint_modified, timeline, lsn, &xlrec)?;
}
} else if decoded.xl_rmid == pg_constants::RM_XACT_ID {
let info = decoded.xl_info & pg_constants::XLOG_XACT_OPMASK;
@@ -501,10 +596,17 @@ pub fn save_decoded_record(
)?;
} else if info == pg_constants::XLOG_MULTIXACT_CREATE_ID {
let xlrec = XlMultiXactCreate::decode(&mut buf);
save_multixact_create_record(checkpoint, timeline, lsn, &xlrec, decoded)?;
save_multixact_create_record(
checkpoint,
checkpoint_modified,
timeline,
lsn,
&xlrec,
decoded,
)?;
} else if info == pg_constants::XLOG_MULTIXACT_TRUNCATE_ID {
let xlrec = XlMultiXactTruncate::decode(&mut buf);
save_multixact_truncate_record(checkpoint, timeline, lsn, &xlrec)?;
save_multixact_truncate_record(checkpoint, checkpoint_modified, timeline, lsn, &xlrec)?;
}
} else if decoded.xl_rmid == pg_constants::RM_RELMAP_ID {
let xlrec = XlRelmapUpdate::decode(&mut buf);
@@ -513,7 +615,10 @@ pub fn save_decoded_record(
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_NEXTOID {
let next_oid = buf.get_u32_le();
checkpoint.nextOid = next_oid;
if checkpoint.nextOid != next_oid {
checkpoint.nextOid = next_oid;
*checkpoint_modified = true;
}
} else if info == pg_constants::XLOG_CHECKPOINT_ONLINE
|| info == pg_constants::XLOG_CHECKPOINT_SHUTDOWN
{
@@ -529,17 +634,19 @@ pub fn save_decoded_record(
);
if (checkpoint.oldestXid.wrapping_sub(xlog_checkpoint.oldestXid) as i32) < 0 {
checkpoint.oldestXid = xlog_checkpoint.oldestXid;
*checkpoint_modified = true;
}
}
}
// Now that this record has been handled, let the repository know that
// it is up-to-date to this LSN
timeline.advance_last_record_lsn(lsn.align());
Ok(())
}
/// Subroutine of save_decoded_record(), to handle an XLOG_DBASE_CREATE record.
fn save_xlog_dbase_create(timeline: &dyn Timeline, lsn: Lsn, rec: &XlCreateDatabase) -> Result<()> {
fn save_xlog_dbase_create(
timeline: &dyn TimelineWriter,
lsn: Lsn,
rec: &XlCreateDatabase,
) -> Result<()> {
let db_id = rec.db_id;
let tablespace_id = rec.tablespace_id;
let src_db_id = rec.src_db_id;
@@ -558,37 +665,36 @@ fn save_xlog_dbase_create(timeline: &dyn Timeline, lsn: Lsn, rec: &XlCreateDatab
let mut num_rels_copied = 0;
let mut num_blocks_copied = 0;
for src_rel in rels {
assert_eq!(src_rel.spcnode, src_tablespace_id);
assert_eq!(src_rel.dbnode, src_db_id);
for rel in rels {
if let RelishTag::Relation(src_rel) = rel {
assert_eq!(src_rel.spcnode, src_tablespace_id);
assert_eq!(src_rel.dbnode, src_db_id);
let nblocks = timeline
.get_relish_size(RelishTag::Relation(src_rel), req_lsn)?
.unwrap_or(0);
let dst_rel = RelTag {
spcnode: tablespace_id,
dbnode: db_id,
relnode: src_rel.relnode,
forknum: src_rel.forknum,
};
let nblocks = timeline.get_relish_size(rel, req_lsn)?.unwrap_or(0);
let dst_rel = RelTag {
spcnode: tablespace_id,
dbnode: db_id,
relnode: src_rel.relnode,
forknum: src_rel.forknum,
};
// Copy content
for blknum in 0..nblocks {
let content =
timeline.get_page_at_lsn_nowait(RelishTag::Relation(src_rel), blknum, req_lsn)?;
// Copy content
for blknum in 0..nblocks {
let content = timeline.get_page_at_lsn(rel, blknum, req_lsn)?;
debug!("copying block {} from {} to {}", blknum, src_rel, dst_rel);
debug!("copying block {} from {} to {}", blknum, src_rel, dst_rel);
timeline.put_page_image(RelishTag::Relation(dst_rel), blknum, lsn, content)?;
num_blocks_copied += 1;
timeline.put_page_image(RelishTag::Relation(dst_rel), blknum, lsn, content)?;
num_blocks_copied += 1;
}
if nblocks == 0 {
// make sure we have some trace of the relation, even if it's empty
timeline.put_truncation(RelishTag::Relation(dst_rel), lsn, 0)?;
}
num_rels_copied += 1;
}
if nblocks == 0 {
// make sure we have some trace of the relation, even if it's empty
timeline.put_truncation(RelishTag::Relation(dst_rel), lsn, 0)?;
}
num_rels_copied += 1;
}
// Copy relfilemap
@@ -597,7 +703,7 @@ fn save_xlog_dbase_create(timeline: &dyn Timeline, lsn: Lsn, rec: &XlCreateDatab
for tag in timeline.list_nonrels(req_lsn)? {
if let RelishTag::FileNodeMap { spcnode, dbnode } = tag {
if spcnode == src_tablespace_id && dbnode == src_db_id {
let img = timeline.get_page_at_lsn_nowait(tag, 0, req_lsn)?;
let img = timeline.get_page_at_lsn(tag, 0, req_lsn)?;
let new_tag = RelishTag::FileNodeMap {
spcnode: tablespace_id,
dbnode: db_id,
@@ -617,7 +723,11 @@ fn save_xlog_dbase_create(timeline: &dyn Timeline, lsn: Lsn, rec: &XlCreateDatab
/// Subroutine of save_decoded_record(), to handle an XLOG_SMGR_TRUNCATE record.
///
/// This is the same logic as in PostgreSQL's smgr_redo() function.
fn save_xlog_smgr_truncate(timeline: &dyn Timeline, lsn: Lsn, rec: &XlSmgrTruncate) -> Result<()> {
fn save_xlog_smgr_truncate(
timeline: &dyn TimelineWriter,
lsn: Lsn,
rec: &XlSmgrTruncate,
) -> Result<()> {
let spcnode = rec.rnode.spcnode;
let dbnode = rec.rnode.dbnode;
let relnode = rec.rnode.relnode;
@@ -679,7 +789,7 @@ fn save_xlog_smgr_truncate(timeline: &dyn Timeline, lsn: Lsn, rec: &XlSmgrTrunca
/// Subroutine of save_decoded_record(), to handle an XLOG_XACT_* records.
///
fn save_xact_record(
timeline: &dyn Timeline,
timeline: &dyn TimelineWriter,
lsn: Lsn,
parsed: &XlXactParsedRecord,
decoded: &DecodedWALRecord,
@@ -690,12 +800,12 @@ fn save_xact_record(
let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
let rec = WALRecord {
lsn,
will_init: false,
rec: decoded.record.clone(),
main_data_offset: decoded.main_data_offset as u32,
};
timeline.put_wal_record(
lsn,
RelishTag::Slru {
slru: SlruKind::Clog,
segno,
@@ -711,6 +821,7 @@ fn save_xact_record(
let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
timeline.put_wal_record(
lsn,
RelishTag::Slru {
slru: SlruKind::Clog,
segno,
@@ -736,7 +847,8 @@ fn save_xact_record(
fn save_clog_truncate_record(
checkpoint: &mut CheckPoint,
timeline: &dyn Timeline,
checkpoint_modified: &mut bool,
timeline: &dyn TimelineWriter,
lsn: Lsn,
xlrec: &XlClogTruncate,
) -> Result<()> {
@@ -754,6 +866,7 @@ fn save_clog_truncate_record(
// TODO Figure out if there will be any issues with replica.
checkpoint.oldestXid = xlrec.oldest_xid;
checkpoint.oldestXidDB = xlrec.oldest_xid_db;
*checkpoint_modified = true;
// TODO Treat AdvanceOldestClogXid() or write a comment why we don't need it
@@ -796,13 +909,13 @@ fn save_clog_truncate_record(
fn save_multixact_create_record(
checkpoint: &mut CheckPoint,
timeline: &dyn Timeline,
checkpoint_modified: &mut bool,
timeline: &dyn TimelineWriter,
lsn: Lsn,
xlrec: &XlMultiXactCreate,
decoded: &DecodedWALRecord,
) -> Result<()> {
let rec = WALRecord {
lsn,
will_init: false,
rec: decoded.record.clone(),
main_data_offset: decoded.main_data_offset as u32,
@@ -811,6 +924,7 @@ fn save_multixact_create_record(
let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
timeline.put_wal_record(
lsn,
RelishTag::Slru {
slru: SlruKind::MultiXactOffsets,
segno,
@@ -830,6 +944,7 @@ fn save_multixact_create_record(
let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
timeline.put_wal_record(
lsn,
RelishTag::Slru {
slru: SlruKind::MultiXactMembers,
segno,
@@ -852,9 +967,11 @@ fn save_multixact_create_record(
}
if xlrec.mid >= checkpoint.nextMulti {
checkpoint.nextMulti = xlrec.mid + 1;
*checkpoint_modified = true;
}
if xlrec.moff + xlrec.nmembers > checkpoint.nextMultiOffset {
checkpoint.nextMultiOffset = xlrec.moff + xlrec.nmembers;
*checkpoint_modified = true;
}
let max_mbr_xid = xlrec.members.iter().fold(0u32, |acc, mbr| {
if mbr.xid.wrapping_sub(acc) as i32 > 0 {
@@ -864,18 +981,22 @@ fn save_multixact_create_record(
}
});
checkpoint.update_next_xid(max_mbr_xid);
if checkpoint.update_next_xid(max_mbr_xid) {
*checkpoint_modified = true;
}
Ok(())
}
fn save_multixact_truncate_record(
checkpoint: &mut CheckPoint,
timeline: &dyn Timeline,
checkpoint_modified: &mut bool,
timeline: &dyn TimelineWriter,
lsn: Lsn,
xlrec: &XlMultiXactTruncate,
) -> Result<()> {
checkpoint.oldestMulti = xlrec.end_trunc_off;
checkpoint.oldestMultiDB = xlrec.oldest_multi_db;
*checkpoint_modified = true;
// PerformMembersTruncation
let maxsegment: i32 = mx_offset_to_member_segment(pg_constants::MAX_MULTIXACT_OFFSET);
@@ -909,7 +1030,7 @@ fn save_multixact_truncate_record(
}
fn save_relmap_page(
timeline: &dyn Timeline,
timeline: &dyn TimelineWriter,
lsn: Lsn,
xlrec: &XlRelmapUpdate,
decoded: &DecodedWALRecord,

View File

@@ -3,72 +3,305 @@
use crate::branches;
use crate::layered_repository::LayeredRepository;
use crate::repository::Repository;
use crate::repository::{Repository, Timeline};
use crate::tenant_threads;
use crate::walredo::PostgresRedoManager;
use crate::PageServerConf;
use anyhow::{anyhow, bail, Result};
use anyhow::{anyhow, bail, Context, Result};
use lazy_static::lazy_static;
use log::info;
use log::*;
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::fmt;
use std::fs;
use std::str::FromStr;
use std::sync::{Arc, Mutex};
use zenith_utils::zid::ZTenantId;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::{Arc, Mutex, MutexGuard};
use zenith_utils::zid::{ZTenantId, ZTimelineId};
lazy_static! {
pub static ref REPOSITORY: Mutex<HashMap<ZTenantId, Arc<dyn Repository>>> =
Mutex::new(HashMap::new());
static ref TENANTS: Mutex<HashMap<ZTenantId, Tenant>> = Mutex::new(HashMap::new());
}
pub fn init(conf: &'static PageServerConf) {
let mut m = REPOSITORY.lock().unwrap();
struct Tenant {
state: TenantState,
repo: Option<Arc<dyn Repository>>,
}
#[derive(Debug, Serialize, Deserialize, Clone, Copy, PartialEq, Eq)]
pub enum TenantState {
// This tenant only exists in cloud storage. It cannot be accessed.
CloudOnly,
// This tenant exists in cloud storage, and we are currently downloading it to local disk.
// It cannot be accessed yet, not until it's been fully downloaded to local disk.
Downloading,
// All data for this tenant is complete on local disk, but we haven't loaded the Repository,
// Timeline and Layer structs into memory yet, so it cannot be accessed yet.
//Ready,
// This tenant exists on local disk, and the layer map has been loaded into memory.
// The local disk might have some newer files that don't exist in cloud storage yet.
Active,
// Tenant is active, but there is no walreceiver connection.
Idle,
// This tenant exists on local disk, and the layer map has been loaded into memory.
// The local disk might have some newer files that don't exist in cloud storage yet.
// The tenant cannot be accessed anymore for any reason, but graceful shutdown.
Stopping,
}
/// A remote storage timeline synchronization event, that needs another step
/// to be fully completed.
#[derive(Debug)]
pub enum PostTimelineSyncStep {
/// The timeline cannot be synchronized anymore due to some sync issues.
/// Needs to be removed from pageserver, to avoid further data diverging.
Evict,
/// A new timeline got downloaded and needs to be loaded into pageserver.
RegisterDownload,
}
impl fmt::Display for TenantState {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
TenantState::CloudOnly => f.write_str("CloudOnly"),
TenantState::Downloading => f.write_str("Downloading"),
TenantState::Active => f.write_str("Active"),
TenantState::Idle => f.write_str("Idle"),
TenantState::Stopping => f.write_str("Stopping"),
}
}
}
fn access_tenants() -> MutexGuard<'static, HashMap<ZTenantId, Tenant>> {
TENANTS.lock().unwrap()
}
static SHUTDOWN_REQUESTED: AtomicBool = AtomicBool::new(false);
pub fn init(conf: &'static PageServerConf) {
for dir_entry in fs::read_dir(conf.tenants_path()).unwrap() {
let tenantid =
ZTenantId::from_str(dir_entry.unwrap().file_name().to_str().unwrap()).unwrap();
// Set up a WAL redo manager, for applying WAL records.
let walredo_mgr = PostgresRedoManager::new(conf, tenantid);
// Set up an object repository, for actual data storage.
let repo = Arc::new(LayeredRepository::new(
conf,
Arc::new(walredo_mgr),
tenantid,
));
LayeredRepository::launch_checkpointer_thread(conf, repo.clone());
{
let mut m = access_tenants();
let tenant = Tenant {
state: TenantState::CloudOnly,
repo: None,
};
m.insert(tenantid, tenant);
}
init_repo(conf, tenantid);
info!("initialized storage for tenant: {}", &tenantid);
m.insert(tenantid, repo);
}
}
fn init_repo(conf: &'static PageServerConf, tenant_id: ZTenantId) {
// Set up a WAL redo manager, for applying WAL records.
let walredo_mgr = PostgresRedoManager::new(conf, tenant_id);
// Set up an object repository, for actual data storage.
let repo = Arc::new(LayeredRepository::new(
conf,
Arc::new(walredo_mgr),
tenant_id,
false,
));
let mut m = access_tenants();
let tenant = m.get_mut(&tenant_id).unwrap();
tenant.repo = Some(repo);
tenant.state = TenantState::Idle;
}
pub fn perform_post_timeline_sync_steps(
conf: &'static PageServerConf,
post_sync_steps: HashMap<(ZTenantId, ZTimelineId), PostTimelineSyncStep>,
) {
if post_sync_steps.is_empty() {
return;
}
info!("Performing {} post-sync steps", post_sync_steps.len());
trace!("Steps: {:?}", post_sync_steps);
{
let mut m = access_tenants();
for &(tenant_id, timeline_id) in post_sync_steps.keys() {
let tenant = m.entry(tenant_id).or_insert_with(|| Tenant {
state: TenantState::Downloading,
repo: None,
});
tenant.state = TenantState::Downloading;
match &tenant.repo {
Some(repo) => {
init_timeline(repo.as_ref(), timeline_id);
tenant.state = TenantState::Idle;
return;
}
None => log::warn!("Initialize new repo"),
}
tenant.state = TenantState::Idle;
}
}
for ((tenant_id, timeline_id), post_sync_step) in post_sync_steps {
match post_sync_step {
PostTimelineSyncStep::Evict => {
if let Err(e) = get_repository_for_tenant(tenant_id)
.and_then(|repo| repo.unload_timeline(timeline_id))
{
error!(
"Failed to remove repository for tenant {}, timeline {}: {:#}",
tenant_id, timeline_id, e
)
}
}
PostTimelineSyncStep::RegisterDownload => {
// init repo updates Tenant state
init_repo(conf, tenant_id);
let new_repo = get_repository_for_tenant(tenant_id).unwrap();
init_timeline(new_repo.as_ref(), timeline_id);
}
}
}
}
fn init_timeline(repo: &dyn Repository, timeline_id: ZTimelineId) {
match repo.get_timeline(timeline_id) {
Ok(_timeline) => log::info!("Successfully initialized timeline {}", timeline_id),
Err(e) => log::error!("Failed to init timeline {}, reason: {:#}", timeline_id, e),
}
}
// Check this flag in the thread loops to know when to exit
pub fn shutdown_requested() -> bool {
SHUTDOWN_REQUESTED.load(Ordering::Relaxed)
}
pub fn shutdown_all_tenants() -> Result<()> {
SHUTDOWN_REQUESTED.swap(true, Ordering::Relaxed);
let tenantids = list_tenantids()?;
for tenantid in &tenantids {
set_tenant_state(*tenantid, TenantState::Stopping)?;
}
for tenantid in tenantids {
// Wait for checkpointer and GC to finish their job
tenant_threads::wait_for_tenant_threads_to_stop(tenantid);
let repo = get_repository_for_tenant(tenantid)?;
debug!("shutdown tenant {}", tenantid);
repo.shutdown()?;
}
Ok(())
}
pub fn create_repository_for_tenant(
conf: &'static PageServerConf,
tenantid: ZTenantId,
) -> Result<()> {
let mut m = REPOSITORY.lock().unwrap();
// First check that the tenant doesn't exist already
if m.get(&tenantid).is_some() {
bail!("tenant {} already exists", tenantid);
{
let mut m = access_tenants();
// First check that the tenant doesn't exist already
if m.get(&tenantid).is_some() {
bail!("tenant {} already exists", tenantid);
}
let tenant = Tenant {
state: TenantState::CloudOnly,
repo: None,
};
m.insert(tenantid, tenant);
}
let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, tenantid));
let repo = branches::create_repo(conf, tenantid, wal_redo_manager)?;
m.insert(tenantid, repo);
let mut m = access_tenants();
let tenant = m.get_mut(&tenantid).unwrap();
tenant.repo = Some(repo);
tenant.state = TenantState::Idle;
Ok(())
}
pub fn insert_repository_for_tenant(tenantid: ZTenantId, repo: Arc<dyn Repository>) {
let o = &mut REPOSITORY.lock().unwrap();
o.insert(tenantid, repo);
// If tenant is not found in the repository, return CloudOnly state
pub fn get_tenant_state(tenantid: ZTenantId) -> TenantState {
let m = access_tenants();
match m.get(&tenantid) {
Some(tenant) => tenant.state,
None => TenantState::CloudOnly,
}
}
pub fn get_repository_for_tenant(tenantid: &ZTenantId) -> Result<Arc<dyn Repository>> {
let o = &REPOSITORY.lock().unwrap();
o.get(tenantid)
.map(|repo| Arc::clone(repo))
.ok_or_else(|| anyhow!("repository not found for tenant name {}", tenantid))
pub fn set_tenant_state(tenantid: ZTenantId, newstate: TenantState) -> Result<TenantState> {
let mut m = access_tenants();
let tenant = m.get_mut(&tenantid);
match tenant {
Some(tenant) => {
if newstate == TenantState::Idle && tenant.state != TenantState::Active {
// Only Active tenant can become Idle
return Ok(tenant.state);
}
info!("set_tenant_state: {} -> {}", tenant.state, newstate);
tenant.state = newstate;
Ok(tenant.state)
}
None => bail!("Tenant not found for tenant {}", tenantid),
}
}
pub fn get_repository_for_tenant(tenantid: ZTenantId) -> Result<Arc<dyn Repository>> {
let m = access_tenants();
let tenant = m
.get(&tenantid)
.ok_or_else(|| anyhow!("Tenant not found for tenant {}", tenantid))?;
match &tenant.repo {
Some(repo) => Ok(Arc::clone(repo)),
None => anyhow::bail!("Repository for tenant {} is not yet valid", tenantid),
}
}
pub fn get_timeline_for_tenant(
tenantid: ZTenantId,
timelineid: ZTimelineId,
) -> Result<Arc<dyn Timeline>> {
get_repository_for_tenant(tenantid)?
.get_timeline(timelineid)
.with_context(|| format!("cannot fetch timeline {}", timelineid))
}
fn list_tenantids() -> Result<Vec<ZTenantId>> {
let m = access_tenants();
m.iter()
.map(|v| {
let (tenantid, _) = v;
Ok(*tenantid)
})
.collect()
}
#[derive(Serialize, Deserialize, Clone)]
pub struct TenantInfo {
#[serde(with = "hex")]
pub id: ZTenantId,
pub state: TenantState,
}
pub fn list_tenants() -> Result<Vec<TenantInfo>> {
let m = access_tenants();
m.iter()
.map(|v| {
let (id, tenant) = v;
Ok(TenantInfo {
id: *id,
state: tenant.state,
})
})
.collect()
}

View File

@@ -0,0 +1,149 @@
//! This module contains functions to serve per-tenant background processes,
//! such as checkpointer and GC
use crate::tenant_mgr;
use crate::tenant_mgr::TenantState;
use crate::CheckpointConfig;
use crate::PageServerConf;
use anyhow::Result;
use lazy_static::lazy_static;
use std::collections::HashMap;
use std::sync::Mutex;
use std::thread::JoinHandle;
use std::time::Duration;
use tracing::*;
use zenith_metrics::{register_int_gauge_vec, IntGaugeVec};
use zenith_utils::zid::ZTenantId;
struct TenantHandleEntry {
checkpointer_handle: Option<JoinHandle<()>>,
gc_handle: Option<JoinHandle<()>>,
}
// Preserve handles to wait for thread completion
// at shutdown
lazy_static! {
static ref TENANT_HANDLES: Mutex<HashMap<ZTenantId, TenantHandleEntry>> =
Mutex::new(HashMap::new());
}
lazy_static! {
static ref TENANT_THREADS_COUNT: IntGaugeVec = register_int_gauge_vec!(
"tenant_threads_count",
"Number of live tenant threads",
&["tenant_thread_type"]
)
.expect("failed to define a metric");
}
// Launch checkpointer and GC for the tenant.
// It's possible that the threads are running already,
// if so, just don't spawn new ones.
pub fn start_tenant_threads(conf: &'static PageServerConf, tenantid: ZTenantId) {
let mut handles = TENANT_HANDLES.lock().unwrap();
let h = handles
.entry(tenantid)
.or_insert_with(|| TenantHandleEntry {
checkpointer_handle: None,
gc_handle: None,
});
if h.checkpointer_handle.is_none() {
h.checkpointer_handle = std::thread::Builder::new()
.name("Checkpointer thread".into())
.spawn(move || {
checkpoint_loop(tenantid, conf).expect("Checkpointer thread died");
})
.ok();
}
if h.gc_handle.is_none() {
h.gc_handle = std::thread::Builder::new()
.name("GC thread".into())
.spawn(move || {
gc_loop(tenantid, conf).expect("GC thread died");
})
.ok();
}
}
pub fn wait_for_tenant_threads_to_stop(tenantid: ZTenantId) {
let mut handles = TENANT_HANDLES.lock().unwrap();
if let Some(h) = handles.get_mut(&tenantid) {
h.checkpointer_handle.take().map(JoinHandle::join);
trace!("checkpointer for tenant {} has stopped", tenantid);
h.gc_handle.take().map(JoinHandle::join);
trace!("gc for tenant {} has stopped", tenantid);
}
handles.remove(&tenantid);
}
///
/// Checkpointer thread's main loop
///
fn checkpoint_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> {
let gauge = TENANT_THREADS_COUNT.with_label_values(&["checkpointer"]);
gauge.inc();
scopeguard::defer! {
gauge.dec();
}
loop {
if tenant_mgr::get_tenant_state(tenantid) != TenantState::Active {
break;
}
std::thread::sleep(conf.checkpoint_period);
trace!("checkpointer thread for tenant {} waking up", tenantid);
// checkpoint timelines that have accumulated more than CHECKPOINT_DISTANCE
// bytes of WAL since last checkpoint.
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
repo.checkpoint_iteration(CheckpointConfig::Distance(conf.checkpoint_distance))?;
}
trace!(
"checkpointer thread stopped for tenant {} state is {}",
tenantid,
tenant_mgr::get_tenant_state(tenantid)
);
Ok(())
}
///
/// GC thread's main loop
///
fn gc_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> {
let gauge = TENANT_THREADS_COUNT.with_label_values(&["gc"]);
gauge.inc();
scopeguard::defer! {
gauge.dec();
}
loop {
if tenant_mgr::get_tenant_state(tenantid) != TenantState::Active {
break;
}
trace!("gc thread for tenant {} waking up", tenantid);
// Garbage collect old files that are not needed for PITR anymore
if conf.gc_horizon > 0 {
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
repo.gc_iteration(None, conf.gc_horizon, false).unwrap();
}
// TODO Write it in more adequate way using
// condvar.wait_timeout() or something
let mut sleep_time = conf.gc_period.as_secs();
while sleep_time > 0 && tenant_mgr::get_tenant_state(tenantid) == TenantState::Active {
sleep_time -= 1;
std::thread::sleep(Duration::from_secs(1));
}
}
trace!(
"GC thread stopped for tenant {} state is {}",
tenantid,
tenant_mgr::get_tenant_state(tenantid)
);
Ok(())
}

View File

@@ -0,0 +1,619 @@
//!
//! VirtualFile is like a normal File, but it's not bound directly to
//! a file descriptor. Instead, the file is opened when it's read from,
//! and if too many files are open globally in the system, least-recently
//! used ones are closed.
//!
//! To track which files have been recently used, we use the clock algorithm
//! with a 'recently_used' flag on each slot.
//!
//! This is similar to PostgreSQL's virtual file descriptor facility in
//! src/backend/storage/file/fd.c
//!
use std::fs::{File, OpenOptions};
use std::io::{Error, ErrorKind, Read, Seek, SeekFrom, Write};
use std::os::unix::fs::FileExt;
use std::path::{Path, PathBuf};
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
use std::sync::{RwLock, RwLockWriteGuard};
use once_cell::sync::OnceCell;
///
/// A virtual file descriptor. You can use this just like std::fs::File, but internally
/// the underlying file is closed if the system is low on file descriptors,
/// and re-opened when it's accessed again.
///
/// Like with std::fs::File, multiple threads can read/write the file concurrently,
/// holding just a shared reference the same VirtualFile, using the read_at() / write_at()
/// functions from the FileExt trait. But the functions from the Read/Write/Seek traits
/// require a mutable reference, because they modify the "current position".
///
/// Each VirtualFile has a physical file descriptor in the global OPEN_FILES array, at the
/// slot that 'handle points to, if the underlying file is currently open. If it's not
/// currently open, the 'handle' can still point to the slot where it was last kept. The
/// 'tag' field is used to detect whether the handle still is valid or not.
///
pub struct VirtualFile {
/// Lazy handle to the global file descriptor cache. The slot that this points to
/// might contain our File, or it may be empty, or it may contain a File that
/// belongs to a different VirtualFile.
handle: RwLock<SlotHandle>,
/// Current file position
pos: u64,
/// File path and options to use to open it.
///
/// Note: this only contains the options needed to re-open it. For example,
/// if a new file is created, we only pass the create flag when it's initially
/// opened, in the VirtualFile::create() function, and strip the flag before
/// storing it here.
pub path: PathBuf,
open_options: OpenOptions,
}
#[derive(PartialEq, Clone, Copy)]
struct SlotHandle {
/// Index into OPEN_FILES.slots
index: usize,
/// Value of 'tag' in the slot. If slot's tag doesn't match, then the slot has
/// been recycled and no longer contains the FD for this virtual file.
tag: u64,
}
/// OPEN_FILES is the global array that holds the physical file descriptors that
/// are currently open. Each slot in the array is protected by a separate lock,
/// so that different files can be accessed independently. The lock must be held
/// in write mode to replace the slot with a different file, but a read mode
/// is enough to operate on the file, whether you're reading or writing to it.
///
/// OPEN_FILES starts in uninitialized state, and it's initialized by
/// the virtual_file::init() function. It must be called exactly once at page
/// server startup.
static OPEN_FILES: OnceCell<OpenFiles> = OnceCell::new();
struct OpenFiles {
slots: &'static [Slot],
/// clock arm for the clock algorithm
next: AtomicUsize,
}
struct Slot {
inner: RwLock<SlotInner>,
/// has this file been used since last clock sweep?
recently_used: AtomicBool,
}
struct SlotInner {
/// Counter that's incremented every time a different file is stored here.
/// To avoid the ABA problem.
tag: u64,
/// the underlying file
file: Option<File>,
}
impl OpenFiles {
/// Find a slot to use, evicting an existing file descriptor if needed.
///
/// On return, we hold a lock on the slot, and its 'tag' has been updated
/// recently_used has been set. It's all ready for reuse.
fn find_victim_slot(&self) -> (SlotHandle, RwLockWriteGuard<SlotInner>) {
//
// Run the clock algorithm to find a slot to replace.
//
let num_slots = self.slots.len();
let mut retries = 0;
let mut slot;
let mut slot_guard;
let index;
loop {
let next = self.next.fetch_add(1, Ordering::AcqRel) % num_slots;
slot = &self.slots[next];
// If the recently_used flag on this slot is set, continue the clock
// sweep. Otherwise try to use this slot. If we cannot acquire the
// lock, also continue the clock sweep.
//
// We only continue in this manner for a while, though. If we loop
// through the array twice without finding a victim, just pick the
// next slot and wait until we can reuse it. This way, we avoid
// spinning in the extreme case that all the slots are busy with an
// I/O operation.
if retries < num_slots * 2 {
if !slot.recently_used.swap(false, Ordering::Release) {
if let Ok(guard) = slot.inner.try_write() {
slot_guard = guard;
index = next;
break;
}
}
retries += 1;
} else {
slot_guard = slot.inner.write().unwrap();
index = next;
break;
}
}
//
// We now have the victim slot locked. If it was in use previously, close the
// old file.
//
if let Some(old_file) = slot_guard.file.take() {
drop(old_file);
}
// Prepare the slot for reuse and return it
slot_guard.tag += 1;
slot.recently_used.store(true, Ordering::Relaxed);
(
SlotHandle {
index,
tag: slot_guard.tag,
},
slot_guard,
)
}
}
impl VirtualFile {
/// Open a file in read-only mode. Like File::open.
pub fn open(path: &Path) -> Result<VirtualFile, std::io::Error> {
Self::open_with_options(path, OpenOptions::new().read(true))
}
/// Create a new file for writing. If the file exists, it will be truncated.
/// Like File::create.
pub fn create(path: &Path) -> Result<VirtualFile, std::io::Error> {
Self::open_with_options(
path,
OpenOptions::new().write(true).create(true).truncate(true),
)
}
/// Open a file with given options.
///
/// Note: If any custom flags were set in 'open_options' through OpenOptionsExt,
/// they will be applied also when the file is subsequently re-opened, not only
/// on the first time. Make sure that's sane!
pub fn open_with_options(
path: &Path,
open_options: &OpenOptions,
) -> Result<VirtualFile, std::io::Error> {
let (handle, mut slot_guard) = get_open_files().find_victim_slot();
let file = open_options.open(path)?;
// Strip all options other than read and write.
//
// It would perhaps be nicer to check just for the read and write flags
// explicitly, but OpenOptions doesn't contain any functions to read flags,
// only to set them.
let mut reopen_options = open_options.clone();
reopen_options.create(false);
reopen_options.create_new(false);
reopen_options.truncate(false);
let vfile = VirtualFile {
handle: RwLock::new(handle),
pos: 0,
path: path.to_path_buf(),
open_options: reopen_options,
};
slot_guard.file.replace(file);
Ok(vfile)
}
/// Call File::sync_all() on the underlying File.
pub fn sync_all(&self) -> Result<(), Error> {
self.with_file(|file| file.sync_all())?
}
/// Helper function that looks up the underlying File for this VirtualFile,
/// opening it and evicting some other File if necessary. It calls 'func'
/// with the physical File.
fn with_file<F, R>(&self, mut func: F) -> Result<R, Error>
where
F: FnMut(&File) -> R,
{
let open_files = get_open_files();
let mut handle_guard = {
// Read the cached slot handle, and see if the slot that it points to still
// contains our File.
//
// We only need to hold the handle lock while we read the current handle. If
// another thread closes the file and recycles the slot for a different file,
// we will notice that the handle we read is no longer valid and retry.
let mut handle = *self.handle.read().unwrap();
loop {
// Check if the slot contains our File
{
let slot = &open_files.slots[handle.index];
let slot_guard = slot.inner.read().unwrap();
if slot_guard.tag == handle.tag {
if let Some(file) = &slot_guard.file {
// Found a cached file descriptor.
slot.recently_used.store(true, Ordering::Relaxed);
return Ok(func(file));
}
}
}
// The slot didn't contain our File. We will have to open it ourselves,
// but before that, grab a write lock on handle in the VirtualFile, so
// that no other thread will try to concurrently open the same file.
let handle_guard = self.handle.write().unwrap();
// If another thread changed the handle while we were not holding the lock,
// then the handle might now be valid again. Loop back to retry.
if *handle_guard != handle {
handle = *handle_guard;
continue;
}
break handle_guard;
}
};
// We need to open the file ourselves. The handle in the VirtualFile is
// now locked in write-mode. Find a free slot to put it in.
let (handle, mut slot_guard) = open_files.find_victim_slot();
// Open the physical file
let file = self.open_options.open(&self.path)?;
// Perform the requested operation on it
//
// TODO: We could downgrade the locks to read mode before calling
// 'func', to allow a little bit more concurrency, but the standard
// library RwLock doesn't allow downgrading without releasing the lock,
// and that doesn't seem worth the trouble. (parking_lot RwLock would
// allow it)
let result = func(&file);
// Store the File in the slot and update the handle in the VirtualFile
// to point to it.
slot_guard.file.replace(file);
*handle_guard = handle;
Ok(result)
}
}
impl Drop for VirtualFile {
/// If a VirtualFile is dropped, close the underlying file if it was open.
fn drop(&mut self) {
let handle = self.handle.get_mut().unwrap();
// We could check with a read-lock first, to avoid waiting on an
// unrelated I/O.
let slot = &get_open_files().slots[handle.index];
let mut slot_guard = slot.inner.write().unwrap();
if slot_guard.tag == handle.tag {
slot.recently_used.store(false, Ordering::Relaxed);
slot_guard.file.take();
}
}
}
impl Read for VirtualFile {
fn read(&mut self, buf: &mut [u8]) -> Result<usize, Error> {
let pos = self.pos;
let n = self.read_at(buf, pos)?;
self.pos += n as u64;
Ok(n)
}
}
impl Write for VirtualFile {
fn write(&mut self, buf: &[u8]) -> Result<usize, std::io::Error> {
let pos = self.pos;
let n = self.write_at(buf, pos)?;
self.pos += n as u64;
Ok(n)
}
fn flush(&mut self) -> Result<(), std::io::Error> {
// flush is no-op for File (at least on unix), so we don't need to do
// anything here either.
Ok(())
}
}
impl Seek for VirtualFile {
fn seek(&mut self, pos: SeekFrom) -> Result<u64, Error> {
match pos {
SeekFrom::Start(offset) => {
self.pos = offset;
}
SeekFrom::End(offset) => {
self.pos = self.with_file(|mut file| file.seek(SeekFrom::End(offset)))??
}
SeekFrom::Current(offset) => {
let pos = self.pos as i128 + offset as i128;
if pos < 0 {
return Err(Error::new(
ErrorKind::InvalidInput,
"offset would be negative",
));
}
if pos > u64::MAX as i128 {
return Err(Error::new(ErrorKind::InvalidInput, "offset overflow"));
}
self.pos = pos as u64;
}
}
Ok(self.pos)
}
}
impl FileExt for VirtualFile {
fn read_at(&self, buf: &mut [u8], offset: u64) -> Result<usize, Error> {
self.with_file(|file| file.read_at(buf, offset))?
}
fn write_at(&self, buf: &[u8], offset: u64) -> Result<usize, Error> {
self.with_file(|file| file.write_at(buf, offset))?
}
}
impl OpenFiles {
fn new(num_slots: usize) -> OpenFiles {
let mut slots = Box::new(Vec::with_capacity(num_slots));
for _ in 0..num_slots {
let slot = Slot {
recently_used: AtomicBool::new(false),
inner: RwLock::new(SlotInner { tag: 0, file: None }),
};
slots.push(slot);
}
OpenFiles {
next: AtomicUsize::new(0),
slots: Box::leak(slots),
}
}
}
///
/// Initialize the virtual file module. This must be called once at page
/// server startup.
///
pub fn init(num_slots: usize) {
if OPEN_FILES.set(OpenFiles::new(num_slots)).is_err() {
panic!("virtual_file::init called twice");
}
}
const TEST_MAX_FILE_DESCRIPTORS: usize = 10;
// Get a handle to the global slots array.
fn get_open_files() -> &'static OpenFiles {
//
// In unit tests, page server startup doesn't happen and no one calls
// virtual_file::init(). Initialize it here, with a small array.
//
// This applies to the virtual file tests below, but all other unit
// tests too, so the virtual file facility is always usable in
// unit tests.
//
if cfg!(test) {
OPEN_FILES.get_or_init(|| OpenFiles::new(TEST_MAX_FILE_DESCRIPTORS))
} else {
OPEN_FILES.get().expect("virtual_file::init not called yet")
}
}
#[cfg(test)]
mod tests {
use super::*;
use rand::seq::SliceRandom;
use rand::thread_rng;
use rand::Rng;
use std::sync::Arc;
use std::thread;
// Helper function to slurp contents of a file, starting at the current position,
// into a string
fn read_string<FD>(vfile: &mut FD) -> Result<String, Error>
where
FD: Read,
{
let mut buf = String::new();
vfile.read_to_string(&mut buf)?;
Ok(buf)
}
// Helper function to slurp a portion of a file into a string
fn read_string_at<FD>(vfile: &mut FD, pos: u64, len: usize) -> Result<String, Error>
where
FD: FileExt,
{
let mut buf = Vec::new();
buf.resize(len, 0);
vfile.read_exact_at(&mut buf, pos)?;
Ok(String::from_utf8(buf).unwrap())
}
#[test]
fn test_virtual_files() -> Result<(), Error> {
// The real work is done in the test_files() helper function. This
// allows us to run the same set of tests against a native File, and
// VirtualFile. We trust the native Files and wouldn't need to test them,
// but this allows us to verify that the operations return the same
// results with VirtualFiles as with native Files. (Except that with
// native files, you will run out of file descriptors if the ulimit
// is low enough.)
test_files("virtual_files", |path, open_options| {
VirtualFile::open_with_options(path, open_options)
})
}
#[test]
fn test_physical_files() -> Result<(), Error> {
test_files("physical_files", |path, open_options| {
open_options.open(path)
})
}
fn test_files<OF, FD>(testname: &str, openfunc: OF) -> Result<(), Error>
where
FD: Read + Write + Seek + FileExt,
OF: Fn(&Path, &OpenOptions) -> Result<FD, std::io::Error>,
{
let testdir = crate::PageServerConf::test_repo_dir(testname);
std::fs::create_dir_all(&testdir)?;
let path_a = testdir.join("file_a");
let mut file_a = openfunc(
&path_a,
OpenOptions::new().write(true).create(true).truncate(true),
)?;
file_a.write_all(b"foobar")?;
// cannot read from a file opened in write-only mode
assert!(read_string(&mut file_a).is_err());
// Close the file and re-open for reading
let mut file_a = openfunc(&path_a, OpenOptions::new().read(true))?;
// cannot write to a file opened in read-only mode
assert!(file_a.write(b"bar").is_err());
// Try simple read
assert_eq!("foobar", read_string(&mut file_a)?);
// It's positioned at the EOF now.
assert_eq!("", read_string(&mut file_a)?);
// Test seeks.
assert_eq!(file_a.seek(SeekFrom::Start(1))?, 1);
assert_eq!("oobar", read_string(&mut file_a)?);
assert_eq!(file_a.seek(SeekFrom::End(-2))?, 4);
assert_eq!("ar", read_string(&mut file_a)?);
assert_eq!(file_a.seek(SeekFrom::Start(1))?, 1);
assert_eq!(file_a.seek(SeekFrom::Current(2))?, 3);
assert_eq!("bar", read_string(&mut file_a)?);
assert_eq!(file_a.seek(SeekFrom::Current(-5))?, 1);
assert_eq!("oobar", read_string(&mut file_a)?);
// Test erroneous seeks to before byte 0
assert!(file_a.seek(SeekFrom::End(-7)).is_err());
assert_eq!(file_a.seek(SeekFrom::Start(1))?, 1);
assert!(file_a.seek(SeekFrom::Current(-2)).is_err());
// the erroneous seek should have left the position unchanged
assert_eq!("oobar", read_string(&mut file_a)?);
// Create another test file, and try FileExt functions on it.
let path_b = testdir.join("file_b");
let mut file_b = openfunc(
&path_b,
OpenOptions::new()
.read(true)
.write(true)
.create(true)
.truncate(true),
)?;
file_b.write_all_at(b"BAR", 3)?;
file_b.write_all_at(b"FOO", 0)?;
assert_eq!(read_string_at(&mut file_b, 2, 3)?, "OBA");
// Open a lot of files, enough to cause some evictions. (Or to be precise,
// open the same file many times. The effect is the same.)
//
// leave file_a positioned at offset 1 before we start
assert_eq!(file_a.seek(SeekFrom::Start(1))?, 1);
let mut vfiles = Vec::new();
for _ in 0..100 {
let mut vfile = openfunc(&path_b, OpenOptions::new().read(true))?;
assert_eq!("FOOBAR", read_string(&mut vfile)?);
vfiles.push(vfile);
}
// make sure we opened enough files to definitely cause evictions.
assert!(vfiles.len() > TEST_MAX_FILE_DESCRIPTORS * 2);
// The underlying file descriptor for 'file_a' should be closed now. Try to read
// from it again. We left the file positioned at offset 1 above.
assert_eq!("oobar", read_string(&mut file_a)?);
// Check that all the other FDs still work too. Use them in random order for
// good measure.
vfiles.as_mut_slice().shuffle(&mut thread_rng());
for vfile in vfiles.iter_mut() {
assert_eq!("OOBAR", read_string_at(vfile, 1, 5)?);
}
Ok(())
}
/// Test using VirtualFiles from many threads concurrently. This tests both using
/// a lot of VirtualFiles concurrently, causing evictions, and also using the same
/// VirtualFile from multiple threads concurrently.
#[test]
fn test_vfile_concurrency() -> Result<(), Error> {
const SIZE: usize = 8 * 1024;
const VIRTUAL_FILES: usize = 100;
const THREADS: usize = 100;
const SAMPLE: [u8; SIZE] = [0xADu8; SIZE];
let testdir = crate::PageServerConf::test_repo_dir("vfile_concurrency");
std::fs::create_dir_all(&testdir)?;
// Create a test file.
let test_file_path = testdir.join("concurrency_test_file");
{
let file = File::create(&test_file_path)?;
file.write_all_at(&SAMPLE, 0)?;
}
// Open the file many times.
let mut files = Vec::new();
for _ in 0..VIRTUAL_FILES {
let f = VirtualFile::open_with_options(&test_file_path, OpenOptions::new().read(true))?;
files.push(f);
}
let files = Arc::new(files);
// Launch many threads, and use the virtual files concurrently in random order.
let mut threads = Vec::new();
for threadno in 0..THREADS {
let builder =
thread::Builder::new().name(format!("test_vfile_concurrency thread {}", threadno));
let files = files.clone();
let thread = builder
.spawn(move || {
let mut buf = [0u8; SIZE];
let mut rng = rand::thread_rng();
for _ in 1..1000 {
let f = &files[rng.gen_range(0..files.len())];
f.read_exact_at(&mut buf, 0).unwrap();
assert!(buf == SAMPLE);
}
})
.unwrap();
threads.push(thread);
}
for thread in threads {
thread.join().unwrap();
}
Ok(())
}
}

View File

@@ -54,6 +54,11 @@ impl WalStreamDecoder {
}
}
// The latest LSN position fed to the decoder.
pub fn available(&self) -> Lsn {
self.lsn + self.inputbuf.remaining() as u64
}
pub fn feed_bytes(&mut self, buf: &[u8]) {
self.inputbuf.extend_from_slice(buf);
}
@@ -67,6 +72,10 @@ impl WalStreamDecoder {
/// Err(WalDecodeError): an error occured while decoding, meaning the input was invalid.
///
pub fn poll_decode(&mut self) -> Result<Option<(Lsn, Bytes)>, WalDecodeError> {
let recordbuf;
// Run state machine that validates page headers, and reassembles records
// that cross page boundaries.
loop {
// parse and verify page boundaries as we go
if self.lsn.segment_offset(pg_constants::WAL_SEGMENT_SIZE) == 0 {
@@ -115,29 +124,41 @@ impl WalStreamDecoder {
self.lsn += self.padlen as u64;
self.padlen = 0;
} else if self.contlen == 0 {
// need to have at least the xl_tot_len field
assert!(self.recordbuf.is_empty());
// need to have at least the xl_tot_len field
if self.inputbuf.remaining() < 4 {
return Ok(None);
}
// read xl_tot_len FIXME: assumes little-endian
// peek xl_tot_len at the beginning of the record.
// FIXME: assumes little-endian
self.startlsn = self.lsn;
let xl_tot_len = self.inputbuf.get_u32_le();
let xl_tot_len = (&self.inputbuf[0..4]).get_u32_le();
if (xl_tot_len as usize) < XLOG_SIZE_OF_XLOG_RECORD {
return Err(WalDecodeError {
msg: format!("invalid xl_tot_len {}", xl_tot_len),
lsn: self.lsn,
});
}
self.lsn += 4;
self.recordbuf.clear();
self.recordbuf.reserve(xl_tot_len as usize);
self.recordbuf.put_u32_le(xl_tot_len);
self.contlen = xl_tot_len - 4;
continue;
// Fast path for the common case that the whole record fits on the page.
let pageleft = self.lsn.remaining_in_block() as u32;
if self.inputbuf.remaining() >= xl_tot_len as usize && xl_tot_len <= pageleft {
// Take the record from the 'inputbuf', and validate it.
recordbuf = self.inputbuf.copy_to_bytes(xl_tot_len as usize);
self.lsn += xl_tot_len as u64;
break;
} else {
// Need to assemble the record from pieces. Remember the size of the
// record, and loop back. On next iteration, we will reach the 'else'
// branch below, and copy the part of the record that was on this page
// to 'recordbuf'. Subsequent iterations will skip page headers, and
// append the continuations from the next pages to 'recordbuf'.
self.recordbuf.reserve(xl_tot_len as usize);
self.contlen = xl_tot_len;
continue;
}
} else {
// we're continuing a record, possibly from previous page.
let pageleft = self.lsn.remaining_in_block() as u32;
@@ -154,45 +175,42 @@ impl WalStreamDecoder {
self.contlen -= n as u32;
if self.contlen == 0 {
let recordbuf = std::mem::replace(&mut self.recordbuf, BytesMut::new());
let recordbuf = recordbuf.freeze();
let mut buf = recordbuf.clone();
// XLOG_SWITCH records are special. If we see one, we need to skip
// to the next WAL segment.
let xlogrec = XLogRecord::from_bytes(&mut buf);
let mut crc = crc32c_append(0, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]);
crc = crc32c_append(crc, &recordbuf[0..XLOG_RECORD_CRC_OFFS]);
if crc != xlogrec.xl_crc {
return Err(WalDecodeError {
msg: "WAL record crc mismatch".into(),
lsn: self.lsn,
});
}
if xlogrec.is_xlog_switch_record() {
trace!("saw xlog switch record at {}", self.lsn);
self.padlen =
self.lsn.calc_padding(pg_constants::WAL_SEGMENT_SIZE as u64) as u32;
} else {
// Pad to an 8-byte boundary
self.padlen = self.lsn.calc_padding(8u32) as u32;
}
// Always align resulting LSN on 0x8 boundary -- that is important for getPage()
// and WalReceiver integration. Since this code is used both for WalReceiver and
// initial WAL import let's force alignment right here.
let result = (self.lsn.align(), recordbuf);
return Ok(Some(result));
// The record is now complete.
recordbuf = std::mem::replace(&mut self.recordbuf, BytesMut::new()).freeze();
break;
}
continue;
}
}
// check record boundaries
// deal with continuation records
// We now have a record in the 'recordbuf' local variable.
let xlogrec = XLogRecord::from_slice(&recordbuf[0..XLOG_SIZE_OF_XLOG_RECORD]);
// deal with xlog_switch records
let mut crc = 0;
crc = crc32c_append(crc, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]);
crc = crc32c_append(crc, &recordbuf[0..XLOG_RECORD_CRC_OFFS]);
if crc != xlogrec.xl_crc {
return Err(WalDecodeError {
msg: "WAL record crc mismatch".into(),
lsn: self.lsn,
});
}
// XLOG_SWITCH records are special. If we see one, we need to skip
// to the next WAL segment.
if xlogrec.is_xlog_switch_record() {
trace!("saw xlog switch record at {}", self.lsn);
self.padlen = self.lsn.calc_padding(pg_constants::WAL_SEGMENT_SIZE as u64) as u32;
} else {
// Pad to an 8-byte boundary
self.padlen = self.lsn.calc_padding(8u32) as u32;
}
// Always align resulting LSN on 0x8 boundary -- that is important for getPage()
// and WalReceiver integration. Since this code is used both for WalReceiver and
// initial WAL import let's force alignment right here.
let result = (self.lsn.align(), recordbuf);
Ok(Some(result))
}
}
@@ -211,17 +229,18 @@ pub struct DecodedBkpBlock {
pub blkno: u32,
/* copy of the fork_flags field from the XLogRecordBlockHeader */
flags: u8,
pub flags: u8,
/* Information on full-page image, if any */
has_image: bool, /* has image, even for consistency checking */
pub has_image: bool, /* has image, even for consistency checking */
pub apply_image: bool, /* has image that should be restored */
pub will_init: bool, /* record doesn't need previous page version to apply */
//char *bkp_image;
hole_offset: u16,
hole_length: u16,
bimg_len: u16,
bimg_info: u8,
pub hole_offset: u16,
pub hole_length: u16,
pub bimg_offset: u32,
pub bimg_len: u16,
pub bimg_info: u8,
/* Buffer holding the rmgr-specific data associated with this block */
has_data: bool,
@@ -841,8 +860,19 @@ pub fn decode_wal_record(record: Bytes) -> DecodedWALRecord {
}
// 3. Decode blocks.
let mut ptr = record.len() - buf.remaining();
for blk in blocks.iter_mut() {
if blk.has_image {
blk.bimg_offset = ptr as u32;
ptr += blk.bimg_len as usize;
}
if blk.has_data {
ptr += blk.data_len as usize;
}
}
// We don't need them, so just skip blocks_total_len bytes
buf.advance(blocks_total_len as usize);
assert_eq!(ptr, record.len() - buf.remaining());
let main_data_offset = (xlogrec.xl_tot_len - main_data_len) as usize;

View File

@@ -8,28 +8,28 @@
use crate::relish::*;
use crate::restore_local_repo;
use crate::tenant_mgr;
use crate::tenant_mgr::TenantState;
use crate::tenant_threads;
use crate::waldecoder::*;
use crate::PageServerConf;
use anyhow::{Error, Result};
use anyhow::{bail, Error, Result};
use lazy_static::lazy_static;
use log::*;
use postgres::fallible_iterator::FallibleIterator;
use postgres::replication::ReplicationIter;
use postgres::{Client, NoTls, SimpleQueryMessage, SimpleQueryRow};
use postgres_ffi::xlog_utils::*;
use postgres_ffi::*;
use postgres_protocol::message::backend::ReplicationMessage;
use postgres_types::PgLsn;
use std::cmp::{max, min};
use std::cell::Cell;
use std::collections::HashMap;
use std::fs;
use std::fs::{File, OpenOptions};
use std::io::{Seek, SeekFrom, Write};
use std::str::FromStr;
use std::sync::Mutex;
use std::thread;
use std::thread::sleep;
use std::thread::JoinHandle;
use std::thread_local;
use std::time::{Duration, SystemTime};
use tracing::*;
use zenith_utils::lsn::Lsn;
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
@@ -39,6 +39,8 @@ use zenith_utils::zid::ZTimelineId;
//
struct WalReceiverEntry {
wal_producer_connstr: String,
wal_receiver_handle: Option<JoinHandle<()>>,
tenantid: ZTenantId,
}
lazy_static! {
@@ -46,6 +48,43 @@ lazy_static! {
Mutex::new(HashMap::new());
}
thread_local! {
// Boolean that is true only for WAL receiver threads
//
// This is used in `wait_lsn` to guard against usage that might lead to a deadlock.
pub(crate) static IS_WAL_RECEIVER: Cell<bool> = Cell::new(false);
}
// Wait for walreceiver to stop
// Now it stops when pageserver shutdown is requested.
// In future we can make this more granular and send shutdown signals
// per tenant/timeline to cancel inactive walreceivers.
// TODO deal with blocking pg connections
pub fn stop_wal_receiver(timelineid: ZTimelineId) {
let mut receivers = WAL_RECEIVERS.lock().unwrap();
if let Some(r) = receivers.get_mut(&timelineid) {
r.wal_receiver_handle.take();
// r.wal_receiver_handle.take().map(JoinHandle::join);
}
}
pub fn drop_wal_receiver(timelineid: ZTimelineId, tenantid: ZTenantId) {
let mut receivers = WAL_RECEIVERS.lock().unwrap();
receivers.remove(&timelineid);
// Check if it was the last walreceiver of the tenant.
// TODO now we store one WalReceiverEntry per timeline,
// so this iterator looks a bit strange.
for (_timelineid, entry) in receivers.iter() {
if entry.tenantid == tenantid {
return;
}
}
// When last walreceiver of the tenant is gone, change state to Idle
tenant_mgr::set_tenant_state(tenantid, TenantState::Idle).unwrap();
}
// Launch a new WAL receiver, or tell one that's running about change in connection string
pub fn launch_wal_receiver(
conf: &'static PageServerConf,
@@ -60,18 +99,24 @@ pub fn launch_wal_receiver(
receiver.wal_producer_connstr = wal_producer_connstr.into();
}
None => {
let wal_receiver_handle = thread::Builder::new()
.name("WAL receiver thread".into())
.spawn(move || {
IS_WAL_RECEIVER.with(|c| c.set(true));
thread_main(conf, timelineid, tenantid);
})
.unwrap();
let receiver = WalReceiverEntry {
wal_producer_connstr: wal_producer_connstr.into(),
wal_receiver_handle: Some(wal_receiver_handle),
tenantid,
};
receivers.insert(timelineid, receiver);
// Also launch a new thread to handle this connection
let _walreceiver_thread = thread::Builder::new()
.name("WAL receiver thread".into())
.spawn(move || {
thread_main(conf, timelineid, &tenantid);
})
.unwrap();
// Update tenant state and start tenant threads, if they are not running yet.
tenant_mgr::set_tenant_state(tenantid, TenantState::Active).unwrap();
tenant_threads::start_tenant_threads(conf, tenantid);
}
};
}
@@ -90,17 +135,19 @@ fn get_wal_producer_connstr(timelineid: ZTimelineId) -> String {
//
// This is the entry point for the WAL receiver thread.
//
fn thread_main(conf: &'static PageServerConf, timelineid: ZTimelineId, tenantid: &ZTenantId) {
info!(
"WAL receiver thread started for timeline : '{}'",
timelineid
);
fn thread_main(conf: &'static PageServerConf, timelineid: ZTimelineId, tenantid: ZTenantId) {
let _enter = info_span!("WAL receiver", timeline = %timelineid, tenant = %tenantid).entered();
info!("WAL receiver thread started");
let mut retry_count = 10;
//
// Make a connection to the WAL safekeeper, or directly to the primary PostgreSQL server,
// and start streaming WAL from it. If the connection is lost, keep retrying.
// TODO How long should we retry in case of losing connection?
// Should we retry at all or we can wait for the next callmemaybe request?
//
loop {
while !tenant_mgr::shutdown_requested() && retry_count > 0 {
// Look up the current WAL producer address
let wal_producer_connstr = get_wal_producer_connstr(timelineid);
@@ -111,16 +158,27 @@ fn thread_main(conf: &'static PageServerConf, timelineid: ZTimelineId, tenantid:
"WAL streaming connection failed ({}), retrying in 1 second",
e
);
retry_count -= 1;
sleep(Duration::from_secs(1));
} else {
info!(
"walreceiver disconnected tenant {}, timelineid {}",
tenantid, timelineid
);
break;
}
}
info!("WAL streaming shut down");
// Drop it from list of active WAL_RECEIVERS
// so that next callmemaybe request launched a new thread
drop_wal_receiver(timelineid, tenantid);
}
fn walreceiver_main(
conf: &PageServerConf,
_conf: &PageServerConf,
timelineid: ZTimelineId,
wal_producer_connstr: &str,
tenantid: &ZTenantId,
tenantid: ZTenantId,
) -> Result<(), Error> {
// Connect to the database in replication mode.
info!("connecting to {:?}", wal_producer_connstr);
@@ -146,8 +204,7 @@ fn walreceiver_main(
let end_of_wal = Lsn::from(u64::from(identify.xlogpos));
let mut caught_up = false;
let repository = tenant_mgr::get_repository_for_tenant(tenantid)?;
let timeline = repository.get_timeline(timelineid).unwrap();
let timeline = tenant_mgr::get_timeline_for_tenant(tenantid, timelineid)?;
//
// Start streaming the WAL, from where we left off previously.
@@ -158,15 +215,15 @@ fn walreceiver_main(
let mut startpoint = last_rec_lsn;
if startpoint == Lsn(0) {
error!("No previous WAL position");
bail!("No previous WAL position");
}
// There might be some padding after the last full record, skip it.
startpoint += startpoint.calc_padding(8u32);
info!(
"last_record_lsn {} starting replication from {} for timeline {}, server is at {}...",
last_rec_lsn, startpoint, timelineid, end_of_wal
"last_record_lsn {} starting replication from {}, server is at {}...",
last_rec_lsn, startpoint, end_of_wal
);
let query = format!("START_REPLICATION PHYSICAL {}", startpoint);
@@ -176,7 +233,7 @@ fn walreceiver_main(
let mut waldecoder = WalStreamDecoder::new(startpoint);
let checkpoint_bytes = timeline.get_page_at_lsn_nowait(RelishTag::Checkpoint, 0, startpoint)?;
let checkpoint_bytes = timeline.get_page_at_lsn(RelishTag::Checkpoint, 0, startpoint)?;
let mut checkpoint = CheckPoint::decode(&checkpoint_bytes)?;
trace!("CheckPoint.nextXid = {}", checkpoint.nextXid.value);
@@ -188,78 +245,49 @@ fn walreceiver_main(
let data = xlog_data.data();
let startlsn = Lsn::from(xlog_data.wal_start());
let endlsn = startlsn + data.len() as u64;
let prev_last_rec_lsn = last_rec_lsn;
write_wal_file(
conf,
startlsn,
&timelineid,
pg_constants::WAL_SEGMENT_SIZE,
data,
tenantid,
)?;
trace!("received XLogData between {} and {}", startlsn, endlsn);
waldecoder.feed_bytes(data);
while let Some((lsn, recdata)) = waldecoder.poll_decode()? {
// Save old checkpoint value to compare with it after decoding WAL record
let old_checkpoint_bytes = checkpoint.encode();
let decoded = decode_wal_record(recdata.clone());
let _enter = info_span!("processing record", lsn = %lsn).entered();
// It is important to deal with the aligned records as lsn in getPage@LSN is
// aligned and can be several bytes bigger. Without this alignment we are
// at risk of hittind a deadlock.
assert!(lsn.is_aligned());
let writer = timeline.writer();
let mut checkpoint_modified = false;
let decoded = decode_wal_record(recdata.clone());
restore_local_repo::save_decoded_record(
&mut checkpoint,
&*timeline,
&mut checkpoint_modified,
writer.as_ref(),
&decoded,
recdata,
lsn,
)?;
last_rec_lsn = lsn;
let new_checkpoint_bytes = checkpoint.encode();
// Check if checkpoint data was updated by save_decoded_record
if new_checkpoint_bytes != old_checkpoint_bytes {
timeline.put_page_image(
if checkpoint_modified {
let new_checkpoint_bytes = checkpoint.encode();
writer.put_page_image(
RelishTag::Checkpoint,
0,
lsn,
new_checkpoint_bytes,
)?;
}
}
// Somewhat arbitrarily, if we have at least 10 complete wal segments (16 MB each),
// "checkpoint" the repository to flush all the changes from WAL we've processed
// so far to disk. After this, we don't need the original WAL anymore, and it
// can be removed. This is probably too aggressive for production, but it's useful
// to expose bugs now.
//
// TODO: We don't actually dare to remove the WAL. It's useful for debugging,
// and we might it for logical decoding other things in the future. Although
// we should also be able to fetch it back from the WAL safekeepers or S3 if
// needed.
if prev_last_rec_lsn.segment_number(pg_constants::WAL_SEGMENT_SIZE)
!= last_rec_lsn.segment_number(pg_constants::WAL_SEGMENT_SIZE)
{
info!("switched segment {} to {}", prev_last_rec_lsn, last_rec_lsn);
let (oldest_segno, newest_segno) = find_wal_file_range(
conf,
&timelineid,
pg_constants::WAL_SEGMENT_SIZE,
last_rec_lsn,
tenantid,
)?;
if newest_segno - oldest_segno >= 10 {
// TODO: This is where we could remove WAL older than last_rec_lsn.
//remove_wal_files(timelineid, pg_constants::WAL_SEGMENT_SIZE, last_rec_lsn)?;
}
// Now that this record has been fully handled, including updating the
// checkpoint data, let the repository know that it is up-to-date to this LSN
writer.advance_last_record_lsn(lsn);
last_rec_lsn = lsn;
}
if !caught_up && endlsn >= end_of_wal {
@@ -283,7 +311,7 @@ fn walreceiver_main(
);
if reply_requested {
Some(timeline.get_last_record_lsn())
Some(last_rec_lsn)
} else {
None
}
@@ -293,67 +321,47 @@ fn walreceiver_main(
};
if let Some(last_lsn) = status_update {
// TODO: More thought should go into what values are sent here.
let last_lsn = PgLsn::from(u64::from(last_lsn));
// The last LSN we processed. It is not guaranteed to survive pageserver crash.
let write_lsn = last_lsn;
let flush_lsn = last_lsn;
let apply_lsn = PgLsn::from(0);
// This value doesn't guarantee data durability, but it's ok.
// In setup with WAL service, pageserver durability is guaranteed by safekeepers.
// In setup without WAL service, we just don't care.
let flush_lsn = write_lsn;
// `disk_consistent_lsn` is the LSN at which page server guarantees persistence of all received data
// Depending on the setup we recieve WAL directly from Compute Node or
// from a WAL service.
//
// Senders use the feedback to determine if we are caught up:
// - Safekeepers are free to remove WAL preceding `apply_lsn`,
// as it will never be requested by this page server.
// - Compute Node uses 'apply_lsn' to calculate a lag for back pressure mechanism
// (delay WAL inserts to avoid lagging pageserver responses and WAL overflow).
let apply_lsn = PgLsn::from(u64::from(timeline.get_disk_consistent_lsn()));
let ts = SystemTime::now();
const NO_REPLY: u8 = 0;
physical_stream.standby_status_update(write_lsn, flush_lsn, apply_lsn, ts, NO_REPLY)?;
}
if tenant_mgr::shutdown_requested() {
debug!("stop walreceiver because pageserver shutdown is requested");
break;
}
}
Ok(())
}
fn find_wal_file_range(
conf: &PageServerConf,
timeline: &ZTimelineId,
wal_seg_size: usize,
written_upto: Lsn,
tenant: &ZTenantId,
) -> Result<(u64, u64)> {
let written_upto_segno = written_upto.segment_number(wal_seg_size);
let mut oldest_segno = written_upto_segno;
let mut newest_segno = written_upto_segno;
// Scan the wal directory, and count how many WAL filed we could remove
let wal_dir = conf.wal_dir_path(timeline, tenant);
for entry in fs::read_dir(wal_dir)? {
let entry = entry?;
let path = entry.path();
if path.is_dir() {
continue;
}
let filename = path.file_name().unwrap().to_str().unwrap();
if IsXLogFileName(filename) {
let (segno, _tli) = XLogFromFileName(filename, wal_seg_size);
if segno > written_upto_segno {
// that's strange.
warn!("there is a WAL file from future at {}", path.display());
continue;
}
oldest_segno = min(oldest_segno, segno);
newest_segno = max(newest_segno, segno);
}
}
// FIXME: would be good to assert that there are no gaps in the WAL files
Ok((oldest_segno, newest_segno))
}
/// Data returned from the postgres `IDENTIFY_SYSTEM` command
///
/// See the [postgres docs] for more details.
///
/// [postgres docs]: https://www.postgresql.org/docs/current/protocol-replication.html
#[derive(Debug)]
// As of nightly 2021-09-11, fields that are only read by the type's `Debug` impl still count as
// unused. Relevant issue: https://github.com/rust-lang/rust/issues/88900
#[allow(dead_code)]
pub struct IdentifySystem {
systemid: u64,
timeline: u32,
@@ -394,98 +402,3 @@ pub fn identify_system(client: &mut Client) -> Result<IdentifySystem, Error> {
Err(IdentifyError.into())
}
}
fn write_wal_file(
conf: &PageServerConf,
startpos: Lsn,
timelineid: &ZTimelineId,
wal_seg_size: usize,
buf: &[u8],
tenantid: &ZTenantId,
) -> anyhow::Result<()> {
let mut bytes_left: usize = buf.len();
let mut bytes_written: usize = 0;
let mut partial;
let mut start_pos = startpos;
const ZERO_BLOCK: &[u8] = &[0u8; XLOG_BLCKSZ];
let wal_dir = conf.wal_dir_path(timelineid, tenantid);
/* Extract WAL location for this block */
let mut xlogoff = start_pos.segment_offset(wal_seg_size);
while bytes_left != 0 {
let bytes_to_write;
/*
* If crossing a WAL boundary, only write up until we reach wal
* segment size.
*/
if xlogoff + bytes_left > wal_seg_size {
bytes_to_write = wal_seg_size - xlogoff;
} else {
bytes_to_write = bytes_left;
}
/* Open file */
let segno = start_pos.segment_number(wal_seg_size);
let wal_file_name = XLogFileName(
1, // FIXME: always use Postgres timeline 1
segno,
wal_seg_size,
);
let wal_file_path = wal_dir.join(wal_file_name.clone());
let wal_file_partial_path = wal_dir.join(wal_file_name.clone() + ".partial");
{
let mut wal_file: File;
/* Try to open already completed segment */
if let Ok(file) = OpenOptions::new().write(true).open(&wal_file_path) {
wal_file = file;
partial = false;
} else if let Ok(file) = OpenOptions::new().write(true).open(&wal_file_partial_path) {
/* Try to open existed partial file */
wal_file = file;
partial = true;
} else {
/* Create and fill new partial file */
partial = true;
match OpenOptions::new()
.create(true)
.write(true)
.open(&wal_file_partial_path)
{
Ok(mut file) => {
for _ in 0..(wal_seg_size / XLOG_BLCKSZ) {
file.write_all(ZERO_BLOCK)?;
}
wal_file = file;
}
Err(e) => {
error!("Failed to open log file {:?}: {}", &wal_file_path, e);
return Err(e.into());
}
}
}
wal_file.seek(SeekFrom::Start(xlogoff as u64))?;
wal_file.write_all(&buf[bytes_written..(bytes_written + bytes_to_write)])?;
// FIXME: Flush the file
//wal_file.sync_all()?;
}
/* Write was successful, advance our position */
bytes_written += bytes_to_write;
bytes_left -= bytes_to_write;
start_pos += bytes_to_write as u64;
xlogoff += bytes_to_write;
/* Did we reach the end of a WAL segment? */
if start_pos.segment_offset(wal_seg_size) == 0 {
xlogoff = 0;
if partial {
fs::rename(&wal_file_partial_path, &wal_file_path)?;
}
}
}
Ok(())
}

File diff suppressed because it is too large Load Diff

View File

@@ -58,7 +58,7 @@ impl ControlFileData {
let expectedcrc = crc32c::crc32c(&buf[0..OFFSETOF_CRC]);
// Use serde to deserialize the input as a ControlFileData struct.
let controlfile = ControlFileData::des(buf)?;
let controlfile = ControlFileData::des_prefix(buf)?;
// Check the CRC
if expectedcrc != controlfile.crc {

View File

@@ -32,5 +32,5 @@ pub const fn transaction_id_precedes(id1: TransactionId, id2: TransactionId) ->
}
let diff = id1.wrapping_sub(id2) as i32;
return diff < 0;
diff < 0
}

View File

@@ -46,7 +46,6 @@ pub const SIZE_OF_PAGE_HEADER: u16 = 24;
pub const BITS_PER_HEAPBLOCK: u16 = 2;
pub const HEAPBLOCKS_PER_PAGE: u16 = (BLCKSZ - SIZE_OF_PAGE_HEADER) * 8 / BITS_PER_HEAPBLOCK;
pub const TRANSACTION_STATUS_IN_PROGRESS: u8 = 0x00;
pub const TRANSACTION_STATUS_COMMITTED: u8 = 0x01;
pub const TRANSACTION_STATUS_ABORTED: u8 = 0x02;
pub const TRANSACTION_STATUS_SUB_COMMITTED: u8 = 0x03;
@@ -192,31 +191,6 @@ pub const XLP_LONG_HEADER: u16 = 0x0002;
pub const PG_MAJORVERSION: &str = "14";
// Zenith specific page flags used to distinguish heap and non-heap relations
pub const PD_HEAP_RELATION: u16 = 0x10;
pub const PD_NONHEAP_RELATION: u16 = 0x20;
// bufpage.h
pub const PD_FLAGS_OFFSET: usize = 10; // PageHeaderData.pd_flags
pub const PD_LOWER_OFFSET: usize = 12; // PageHeaderData.pd_lower
// itemid.h
pub const LP_NORMAL: u32 = 1;
// htup_details.h
pub const T_XMIN_OFFS: usize = 0;
pub const T_XMAX_OFFS: usize = 4;
pub const T_INFOMASK_OFFS: usize = 4 * 3 + 2 * 3 + 2;
pub const HEAP_XMIN_COMMITTED: u16 = 0x0100; /* t_xmin committed */
pub const HEAP_XMIN_INVALID: u16 = 0x0200; /* t_xmin invalid/aborted */
pub const HEAP_XMAX_COMMITTED: u16 = 0x0400; /* t_xmax committed */
pub const HEAP_XMAX_INVALID: u16 = 0x0800; /* t_xmax invalid/aborted */
pub const HEAP_XMAX_IS_MULTI: u16 = 0x1000; /* t_xmax is a MultiXactId */
pub const SIZE_OF_PAGE_HEADER_DATA: usize = 24;
// xlogrecord.h
pub const XL_RMID_OFFS: usize = 17;
// List of subdirectories inside pgdata.
// Copied from src/bin/initdb/initdb.c
pub const PGDATA_SUBDIRS: [&str; 22] = [

View File

@@ -9,21 +9,23 @@
use crate::pg_constants;
use crate::CheckPoint;
use crate::ControlFileData;
use crate::FullTransactionId;
use crate::XLogLongPageHeaderData;
use crate::XLogPageHeaderData;
use crate::XLogRecord;
use crate::XLOG_PAGE_MAGIC;
use anyhow::{bail, Result};
use byteorder::{ByteOrder, LittleEndian};
use bytes::BytesMut;
use bytes::{Buf, Bytes};
use bytes::{BufMut, BytesMut};
use crc32c::*;
use log::*;
use std::cmp::max;
use std::cmp::min;
use std::fs::{self, File};
use std::io::prelude::*;
use std::io::SeekFrom;
use std::path::{Path, PathBuf};
use std::time::SystemTime;
use zenith_utils::lsn::Lsn;
@@ -41,6 +43,9 @@ pub const XLOG_SIZE_OF_XLOG_RECORD: usize = std::mem::size_of::<XLogRecord>();
#[allow(clippy::identity_op)]
pub const SIZE_OF_XLOG_RECORD_DATA_HEADER_SHORT: usize = 1 * 2;
// PG timeline is always 1, changing it doesn't have useful meaning in Zenith.
pub const PG_TLI: u32 = 1;
pub type XLogRecPtr = u64;
pub type TimeLineID = u32;
pub type TimestampTz = i64;
@@ -125,8 +130,10 @@ fn find_end_of_wal_segment(
segno: XLogSegNo,
tli: TimeLineID,
wal_seg_size: usize,
) -> u32 {
let mut offs: usize = 0;
start_offset: usize, // start reading at this point
) -> Result<u32> {
// step back to the beginning of the page to read it in...
let mut offs: usize = start_offset - start_offset % XLOG_BLCKSZ;
let mut contlen: usize = 0;
let mut wal_crc: u32 = 0;
let mut crc: u32 = 0;
@@ -135,24 +142,33 @@ fn find_end_of_wal_segment(
let file_name = XLogFileName(tli, segno, wal_seg_size);
let mut last_valid_rec_pos: usize = 0;
let mut file = File::open(data_dir.join(file_name.clone() + ".partial")).unwrap();
file.seek(SeekFrom::Start(offs as u64))?;
let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS];
while offs < wal_seg_size {
// we are at the beginning of the page; read it in
if offs % XLOG_BLCKSZ == 0 {
if let Ok(bytes_read) = file.read(&mut buf) {
if bytes_read != buf.len() {
break;
}
} else {
break;
let bytes_read = file.read(&mut buf)?;
if bytes_read != buf.len() {
bail!(
"failed to read {} bytes from {} at {}",
XLOG_BLCKSZ,
file_name,
offs
);
}
let xlp_magic = LittleEndian::read_u16(&buf[0..2]);
let xlp_info = LittleEndian::read_u16(&buf[2..4]);
let xlp_rem_len = LittleEndian::read_u32(&buf[XLP_REM_LEN_OFFS..XLP_REM_LEN_OFFS + 4]);
// this is expected in current usage when valid WAL starts after page header
if xlp_magic != XLOG_PAGE_MAGIC as u16 {
info!("Invalid WAL file {}.partial magic {}", file_name, xlp_magic);
break;
trace!(
"invalid WAL file {}.partial magic {} at {:?}",
file_name,
xlp_magic,
Lsn(XLogSegNoOffsetToRecPtr(segno, offs as u32, wal_seg_size)),
);
}
if offs == 0 {
offs = XLOG_SIZE_OF_XLOG_LONG_PHD;
@@ -162,11 +178,23 @@ fn find_end_of_wal_segment(
} else {
offs += XLOG_SIZE_OF_XLOG_SHORT_PHD;
}
// ... and step forward again if asked
offs = max(offs, start_offset);
// beginning of the next record
} else if contlen == 0 {
let page_offs = offs % XLOG_BLCKSZ;
let xl_tot_len = LittleEndian::read_u32(&buf[page_offs..page_offs + 4]) as usize;
if xl_tot_len == 0 {
info!(
"find_end_of_wal_segment reached zeros at {:?}, last records ends at {:?}",
Lsn(XLogSegNoOffsetToRecPtr(segno, offs as u32, wal_seg_size)),
Lsn(XLogSegNoOffsetToRecPtr(
segno,
last_valid_rec_pos as u32,
wal_seg_size
))
);
break; // zeros, reached the end
}
last_valid_rec_pos = offs;
@@ -217,7 +245,7 @@ fn find_end_of_wal_segment(
}
}
}
last_valid_rec_pos as u32
Ok(last_valid_rec_pos as u32)
}
///
@@ -230,7 +258,8 @@ pub fn find_end_of_wal(
data_dir: &Path,
wal_seg_size: usize,
precise: bool,
) -> (XLogRecPtr, TimeLineID) {
start_lsn: Lsn, // start reading WAL at this point or later
) -> Result<(XLogRecPtr, TimeLineID)> {
let mut high_segno: XLogSegNo = 0;
let mut high_tli: TimeLineID = 0;
let mut high_ispartial = false;
@@ -272,19 +301,37 @@ pub fn find_end_of_wal(
high_segno += 1;
} else if precise {
/* otherwise locate last record in last partial segment */
high_offs = find_end_of_wal_segment(data_dir, high_segno, high_tli, wal_seg_size);
if start_lsn.segment_number(wal_seg_size) > high_segno {
bail!(
"provided start_lsn {:?} is beyond highest segno {:?} available",
start_lsn,
high_segno,
);
}
let start_offset = if start_lsn.segment_number(wal_seg_size) == high_segno {
start_lsn.segment_offset(wal_seg_size)
} else {
0
};
high_offs = find_end_of_wal_segment(
data_dir,
high_segno,
high_tli,
wal_seg_size,
start_offset,
)?;
}
let high_ptr = XLogSegNoOffsetToRecPtr(high_segno, high_offs, wal_seg_size);
return (high_ptr, high_tli);
return Ok((high_ptr, high_tli));
}
(0, 0)
Ok((0, 0))
}
pub fn main() {
let mut data_dir = PathBuf::new();
data_dir.push(".");
let wal_seg_size = 16 * 1024 * 1024;
let (wal_end, tli) = find_end_of_wal(&data_dir, wal_seg_size, true);
let (wal_end, tli) = find_end_of_wal(&data_dir, wal_seg_size, true, Lsn(0)).unwrap();
println!(
"wal_end={:>08X}{:>08X}, tli={}",
(wal_end >> 32) as u32,
@@ -294,7 +341,12 @@ pub fn main() {
}
impl XLogRecord {
pub fn from_bytes(buf: &mut Bytes) -> XLogRecord {
pub fn from_slice(buf: &[u8]) -> XLogRecord {
use zenith_utils::bin_ser::LeSer;
XLogRecord::des(buf).unwrap()
}
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogRecord {
use zenith_utils::bin_ser::LeSer;
XLogRecord::des_from(&mut buf.reader()).unwrap()
}
@@ -342,10 +394,12 @@ impl CheckPoint {
Ok(CheckPoint::des(buf)?)
}
// Update next XID based on provided new_xid and stored epoch.
// Next XID should be greater than new_xid.
// Also take in account 32-bit wrap-around.
pub fn update_next_xid(&mut self, xid: u32) {
/// Update next XID based on provided new_xid and stored epoch.
/// Next XID should be greater than new_xid. This handles 32-bit
/// XID wraparound correctly.
///
/// Returns 'true' if the XID was updated.
pub fn update_next_xid(&mut self, xid: u32) -> bool {
let xid = xid.wrapping_add(XID_CHECKPOINT_INTERVAL - 1) & !(XID_CHECKPOINT_INTERVAL - 1);
let full_xid = self.nextXid.value;
let new_xid = std::cmp::max(xid + 1, pg_constants::FIRST_NORMAL_TRANSACTION_ID);
@@ -356,35 +410,37 @@ impl CheckPoint {
// wrap-around
epoch += 1;
}
self.nextXid = FullTransactionId {
value: (epoch << 32) | new_xid as u64,
};
let nextXid = (epoch << 32) | new_xid as u64;
if nextXid != self.nextXid.value {
self.nextXid = FullTransactionId { value: nextXid };
return true;
}
}
false
}
}
//
// Generate new WAL segment with single XLOG_CHECKPOINT_SHUTDOWN record.
// Generate new, empty WAL segment.
// We need this segment to start compute node.
// In order to minimize changes in Postgres core, we prefer to
// provide WAL segment from which is can extract checkpoint record in standard way,
// rather then implement some alternative mechanism.
//
pub fn generate_wal_segment(pg_control: &ControlFileData) -> Bytes {
pub fn generate_wal_segment(segno: u64, system_id: u64) -> Bytes {
let mut seg_buf = BytesMut::with_capacity(pg_constants::WAL_SEGMENT_SIZE as usize);
let pageaddr = XLogSegNoOffsetToRecPtr(segno, 0, pg_constants::WAL_SEGMENT_SIZE);
let hdr = XLogLongPageHeaderData {
std: {
XLogPageHeaderData {
xlp_magic: XLOG_PAGE_MAGIC as u16,
xlp_info: pg_constants::XLP_LONG_HEADER,
xlp_tli: 1, // FIXME: always use Postgres timeline 1
xlp_pageaddr: pg_control.checkPoint - XLOG_SIZE_OF_XLOG_LONG_PHD as u64,
xlp_tli: PG_TLI,
xlp_pageaddr: pageaddr,
xlp_rem_len: 0,
..Default::default() // Put 0 in padding fields.
}
},
xlp_sysid: pg_control.system_identifier,
xlp_sysid: system_id,
xlp_seg_size: pg_constants::WAL_SEGMENT_SIZE as u32,
xlp_xlog_blcksz: XLOG_BLCKSZ as u32,
};
@@ -392,36 +448,6 @@ pub fn generate_wal_segment(pg_control: &ControlFileData) -> Bytes {
let hdr_bytes = hdr.encode();
seg_buf.extend_from_slice(&hdr_bytes);
let rec_hdr = XLogRecord {
xl_tot_len: (XLOG_SIZE_OF_XLOG_RECORD
+ SIZE_OF_XLOG_RECORD_DATA_HEADER_SHORT
+ SIZEOF_CHECKPOINT) as u32,
xl_xid: 0, //0 is for InvalidTransactionId
xl_prev: 0,
xl_info: pg_constants::XLOG_CHECKPOINT_SHUTDOWN,
xl_rmid: pg_constants::RM_XLOG_ID,
xl_crc: 0,
..Default::default() // Put 0 in padding fields.
};
let mut rec_shord_hdr_bytes = BytesMut::new();
rec_shord_hdr_bytes.put_u8(pg_constants::XLR_BLOCK_ID_DATA_SHORT);
rec_shord_hdr_bytes.put_u8(SIZEOF_CHECKPOINT as u8);
let rec_bytes = rec_hdr.encode();
let checkpoint_bytes = pg_control.checkPointCopy.encode();
//calculate record checksum
let mut crc = 0;
crc = crc32c_append(crc, &rec_shord_hdr_bytes[..]);
crc = crc32c_append(crc, &checkpoint_bytes[..]);
crc = crc32c_append(crc, &rec_bytes[0..XLOG_RECORD_CRC_OFFS]);
seg_buf.extend_from_slice(&rec_bytes[0..XLOG_RECORD_CRC_OFFS]);
seg_buf.put_u32_le(crc);
seg_buf.extend_from_slice(&rec_shord_hdr_bytes);
seg_buf.extend_from_slice(&checkpoint_bytes);
//zero out the rest of the file
seg_buf.resize(pg_constants::WAL_SEGMENT_SIZE, 0);
seg_buf.freeze()
@@ -463,7 +489,7 @@ mod tests {
let wal_seg_size = 16 * 1024 * 1024;
// 3. Check end_of_wal on non-partial WAL segment (we treat it as fully populated)
let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true);
let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true, Lsn(0)).unwrap();
let wal_end = Lsn(wal_end);
println!("wal_end={}, tli={}", wal_end, tli);
assert_eq!(wal_end, "0/2000000".parse::<Lsn>().unwrap());
@@ -489,7 +515,7 @@ mod tests {
wal_dir.join("000000010000000000000001.partial"),
)
.unwrap();
let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true);
let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true, Lsn(0)).unwrap();
let wal_end = Lsn(wal_end);
println!("wal_end={}, tli={}", wal_end, tli);
assert_eq!(wal_end, waldump_wal_end);

View File

@@ -37,20 +37,27 @@ def rustfmt(fix_inplace: bool = False, no_color: bool = False) -> str:
return cmd
def yapf(fix_inplace: bool) -> str:
cmd = "pipenv run yapf --recursive"
if fix_inplace:
cmd += " --in-place"
else:
cmd += " --diff"
return cmd
def mypy() -> str:
return "pipenv run mypy"
def get_commit_files() -> List[str]:
files = subprocess.check_output(
"git diff --cached --name-only --diff-filter=ACM".split()
)
files = subprocess.check_output("git diff --cached --name-only --diff-filter=ACM".split())
return files.decode().splitlines()
def check(
name: str, suffix: str, cmd: str, changed_files: List[str], no_color: bool = False
):
def check(name: str, suffix: str, cmd: str, changed_files: List[str], no_color: bool = False):
print(f"Checking: {name} ", end="")
applicable_files = list(
filter(lambda fname: fname.strip().endswith(suffix), changed_files)
)
applicable_files = list(filter(lambda fname: fname.strip().endswith(suffix), changed_files))
if not applicable_files:
print(colorify("[NOT APPLICABLE]", Color.CYAN, no_color))
return
@@ -59,7 +66,14 @@ def check(
res = subprocess.run(cmd.split(), capture_output=True)
if res.returncode != 0:
print(colorify("[FAILED]", Color.RED, no_color))
print("Please inspect the output below and run make fmt to fix automatically\n")
if name == "mypy":
print("Please inspect the output below and fix type mismatches.")
else:
print("Please inspect the output below and run make fmt to fix automatically.")
if suffix == ".py":
print("If the output is empty, ensure that you've installed Python tooling by\n"
"running 'pipenv install --dev' in the current directory (no root needed)")
print()
print(res.stdout.decode())
exit(1)
@@ -68,12 +82,11 @@ def check(
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--fix-inplace", action="store_true", help="apply fixes inplace"
)
parser.add_argument(
"--no-color", action="store_true", help="disable colored output", default=not sys.stdout.isatty()
)
parser.add_argument("--fix-inplace", action="store_true", help="apply fixes inplace")
parser.add_argument("--no-color",
action="store_true",
help="disable colored output",
default=not sys.stdout.isatty())
args = parser.parse_args()
files = get_commit_files()
@@ -87,3 +100,17 @@ if __name__ == "__main__":
changed_files=files,
no_color=args.no_color,
)
check(
name="yapf",
suffix=".py",
cmd=yapf(fix_inplace=args.fix_inplace),
changed_files=files,
no_color=args.no_color,
)
check(
name="mypy",
suffix=".py",
cmd=mypy(),
changed_files=files,
no_color=args.no_color,
)

View File

@@ -14,9 +14,10 @@ rand = "0.8.3"
hex = "0.4.3"
serde = "1"
serde_json = "1"
tokio = { version = "1.7.1", features = ["full"] }
tokio-postgres = "0.7.2"
tokio = "1.11"
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" }
clap = "2.33.0"
rustls = "0.19.1"
reqwest = { version = "0.11", features = ["blocking", "json"] }
zenith_utils = { path = "../zenith_utils" }

View File

@@ -1,92 +1,139 @@
use anyhow::{bail, Result};
use anyhow::{anyhow, bail, Context};
use serde::{Deserialize, Serialize};
use std::{
collections::HashMap,
net::{IpAddr, SocketAddr},
};
use std::net::{SocketAddr, ToSocketAddrs};
pub struct CPlaneApi {
// address: SocketAddr,
}
use crate::state::ProxyWaiters;
#[derive(Serialize, Deserialize)]
#[derive(Serialize, Deserialize, Debug, Default)]
pub struct DatabaseInfo {
pub host: IpAddr, // TODO: allow host name here too
pub host: String,
pub port: u16,
pub dbname: String,
pub user: String,
pub password: String,
pub password: Option<String>,
}
#[derive(Serialize, Deserialize, Debug)]
#[serde(untagged)]
enum ProxyAuthResponse {
Ready { conn_info: DatabaseInfo },
Error { error: String },
NotReady { ready: bool }, // TODO: get rid of `ready`
}
impl DatabaseInfo {
pub fn socket_addr(&self) -> SocketAddr {
SocketAddr::new(self.host, self.port)
}
pub fn conn_string(&self) -> String {
format!(
"dbname={} user={} password={}",
self.dbname, self.user, self.password
)
pub fn socket_addr(&self) -> anyhow::Result<SocketAddr> {
let host_port = format!("{}:{}", self.host, self.port);
host_port
.to_socket_addrs()
.with_context(|| format!("cannot resolve {} to SocketAddr", host_port))?
.next()
.ok_or_else(|| anyhow!("cannot resolve at least one SocketAddr"))
}
}
// mock cplane api
impl CPlaneApi {
pub fn new(_address: &SocketAddr) -> CPlaneApi {
CPlaneApi {
// address: address.clone(),
impl From<DatabaseInfo> for tokio_postgres::Config {
fn from(db_info: DatabaseInfo) -> Self {
let mut config = tokio_postgres::Config::new();
config
.host(&db_info.host)
.port(db_info.port)
.dbname(&db_info.dbname)
.user(&db_info.user);
if let Some(password) = db_info.password {
config.password(password);
}
config
}
}
pub struct CPlaneApi<'a> {
auth_endpoint: &'a str,
waiters: &'a ProxyWaiters,
}
impl<'a> CPlaneApi<'a> {
pub fn new(auth_endpoint: &'a str, waiters: &'a ProxyWaiters) -> Self {
Self {
auth_endpoint,
waiters,
}
}
}
impl CPlaneApi<'_> {
pub fn authenticate_proxy_request(
&self,
user: &str,
database: &str,
md5_response: &[u8],
salt: &[u8; 4],
psql_session_id: &str,
) -> anyhow::Result<DatabaseInfo> {
let mut url = reqwest::Url::parse(self.auth_endpoint)?;
url.query_pairs_mut()
.append_pair("login", user)
.append_pair("database", database)
.append_pair("md5response", std::str::from_utf8(md5_response)?)
.append_pair("salt", &hex::encode(salt))
.append_pair("psql_session_id", psql_session_id);
let waiter = self.waiters.register(psql_session_id.to_owned());
println!("cplane request: {}", url);
let resp = reqwest::blocking::get(url)?;
if !resp.status().is_success() {
bail!("Auth failed: {}", resp.status())
}
let auth_info: ProxyAuthResponse = serde_json::from_str(resp.text()?.as_str())?;
println!("got auth info: #{:?}", auth_info);
use ProxyAuthResponse::*;
match auth_info {
Ready { conn_info } => Ok(conn_info),
Error { error } => bail!(error),
NotReady { .. } => waiter.wait()?.map_err(|e| anyhow!(e)),
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;
#[test]
fn test_proxy_auth_response() {
// Ready
let auth: ProxyAuthResponse = serde_json::from_value(json!({
"ready": true,
"conn_info": DatabaseInfo::default(),
}))
.unwrap();
assert!(matches!(
auth,
ProxyAuthResponse::Ready {
conn_info: DatabaseInfo { .. }
}
));
// Error
let auth: ProxyAuthResponse = serde_json::from_value(json!({
"ready": false,
"error": "too bad, so sad",
}))
.unwrap();
assert!(matches!(auth, ProxyAuthResponse::Error { .. }));
// NotReady
let auth: ProxyAuthResponse = serde_json::from_value(json!({
"ready": false,
}))
.unwrap();
assert!(matches!(auth, ProxyAuthResponse::NotReady { .. }));
}
pub fn check_auth(&self, user: &str, md5_response: &[u8], salt: &[u8; 4]) -> Result<()> {
// passwords for both is "mypass"
let auth_map: HashMap<_, &str> = vec![
("stas@zenith", "716ee6e1c4a9364d66285452c47402b1"),
("stas2@zenith", "3996f75df64c16a8bfaf01301b61d582"),
]
.into_iter()
.collect();
let stored_hash = auth_map
.get(&user)
.ok_or_else(|| anyhow::Error::msg("user not found"))?;
let salted_stored_hash = format!(
"md5{:x}",
md5::compute([stored_hash.as_bytes(), salt].concat())
);
let received_hash = std::str::from_utf8(md5_response)?;
println!(
"auth: {} rh={} sh={} ssh={} {:?}",
user, received_hash, stored_hash, salted_stored_hash, salt
);
if received_hash == salted_stored_hash {
Ok(())
} else {
bail!("Auth failed")
}
}
pub fn get_database_uri(&self, _user: &str, _database: &str) -> Result<DatabaseInfo> {
Ok(DatabaseInfo {
host: "127.0.0.1".parse()?,
port: 5432,
dbname: "stas".to_string(),
user: "stas".to_string(),
password: "mypass".to_string(),
})
}
// pub fn create_database(&self, _user: &String, _database: &String) -> Result<DatabaseInfo> {
// Ok(DatabaseInfo {
// host: "127.0.0.1".parse()?,
// port: 5432,
// dbname: "stas".to_string(),
// user: "stas".to_string(),
// password: "mypass".to_string(),
// })
// }
}

View File

@@ -5,79 +5,21 @@
/// (control plane API in our case) and can create new databases and accounts
/// in somewhat transparent manner (again via communication with control plane API).
///
use std::{
collections::HashMap,
net::{SocketAddr, TcpListener},
sync::{mpsc, Arc, Mutex},
thread,
};
use anyhow::{anyhow, bail, ensure, Context};
use clap::{App, Arg, ArgMatches};
use cplane_api::DatabaseInfo;
use rustls::{internal::pemfile, NoClientAuth, ProtocolVersion, ServerConfig};
use anyhow::bail;
use clap::{App, Arg};
use state::{ProxyConfig, ProxyState};
use std::thread;
use zenith_utils::{tcp_listener, GIT_VERSION};
mod cplane_api;
mod mgmt;
mod proxy;
pub struct ProxyConf {
/// main entrypoint for users to connect to
pub proxy_address: SocketAddr,
/// http management endpoint. Upon user account creation control plane
/// will notify us here, so that we can 'unfreeze' user session.
pub mgmt_address: SocketAddr,
/// send unauthenticated users to this URI
pub redirect_uri: String,
/// control plane address where we would check auth.
pub cplane_address: SocketAddr,
pub ssl_config: Option<Arc<ServerConfig>>,
}
pub struct ProxyState {
pub conf: ProxyConf,
pub waiters: Mutex<HashMap<String, mpsc::Sender<anyhow::Result<DatabaseInfo>>>>,
}
fn configure_ssl(arg_matches: &ArgMatches) -> anyhow::Result<Option<Arc<ServerConfig>>> {
let (key_path, cert_path) = match (
arg_matches.value_of("ssl-key"),
arg_matches.value_of("ssl-cert"),
) {
(Some(key_path), Some(cert_path)) => (key_path, cert_path),
(None, None) => return Ok(None),
_ => bail!("either both or neither ssl-key and ssl-cert must be specified"),
};
let key = {
let key_bytes = std::fs::read(key_path).context("SSL key file")?;
let mut keys = pemfile::rsa_private_keys(&mut &key_bytes[..])
.or_else(|_| pemfile::pkcs8_private_keys(&mut &key_bytes[..]))
.map_err(|_| anyhow!("couldn't read TLS keys"))?;
ensure!(keys.len() == 1, "keys.len() = {} (should be 1)", keys.len());
keys.pop().unwrap()
};
let cert_chain = {
let cert_chain_bytes = std::fs::read(cert_path).context("SSL cert file")?;
pemfile::certs(&mut &cert_chain_bytes[..])
.map_err(|_| anyhow!("couldn't read TLS certificates"))?
};
let mut config = ServerConfig::new(NoClientAuth::new());
config.set_single_cert(cert_chain, key)?;
config.versions = vec![ProtocolVersion::TLSv1_3];
Ok(Some(Arc::new(config)))
}
mod state;
mod waiters;
fn main() -> anyhow::Result<()> {
let arg_matches = App::new("Zenith proxy/router")
.version(GIT_VERSION)
.arg(
Arg::with_name("proxy")
.short("p")
@@ -102,6 +44,14 @@ fn main() -> anyhow::Result<()> {
.help("redirect unauthenticated users to given uri")
.default_value("http://localhost:3000/psql_session/"),
)
.arg(
Arg::with_name("auth-endpoint")
.short("a")
.long("auth-endpoint")
.takes_value(true)
.help("redirect unauthenticated users to given uri")
.default_value("http://localhost:3000/authenticate_proxy_request/"),
)
.arg(
Arg::with_name("ssl-key")
.short("k")
@@ -118,38 +68,47 @@ fn main() -> anyhow::Result<()> {
)
.get_matches();
let conf = ProxyConf {
let ssl_config = match (
arg_matches.value_of("ssl-key"),
arg_matches.value_of("ssl-cert"),
) {
(Some(key_path), Some(cert_path)) => {
Some(crate::state::configure_ssl(key_path, cert_path)?)
}
(None, None) => None,
_ => bail!("either both or neither ssl-key and ssl-cert must be specified"),
};
let config = ProxyConfig {
proxy_address: arg_matches.value_of("proxy").unwrap().parse()?,
mgmt_address: arg_matches.value_of("mgmt").unwrap().parse()?,
redirect_uri: arg_matches.value_of("uri").unwrap().parse()?,
cplane_address: "127.0.0.1:3000".parse()?,
ssl_config: configure_ssl(&arg_matches)?,
auth_endpoint: arg_matches.value_of("auth-endpoint").unwrap().parse()?,
ssl_config,
};
let state = ProxyState {
conf,
waiters: Mutex::new(HashMap::new()),
};
let state: &'static ProxyState = Box::leak(Box::new(state));
let state: &ProxyState = Box::leak(Box::new(ProxyState::new(config)));
println!("Version: {}", GIT_VERSION);
// Check that we can bind to address before further initialization
println!("Starting proxy on {}", state.conf.proxy_address);
let pageserver_listener = TcpListener::bind(state.conf.proxy_address)?;
let pageserver_listener = tcp_listener::bind(state.conf.proxy_address)?;
println!("Starting mgmt on {}", state.conf.mgmt_address);
let mgmt_listener = TcpListener::bind(state.conf.mgmt_address)?;
let mgmt_listener = tcp_listener::bind(state.conf.mgmt_address)?;
let threads = vec![
let threads = [
// Spawn a thread to listen for connections. It will spawn further threads
// for each connection.
thread::Builder::new()
.name("Proxy thread".into())
.name("Listener thread".into())
.spawn(move || proxy::thread_main(state, pageserver_listener))?,
thread::Builder::new()
.name("Mgmt thread".into())
.spawn(move || mgmt::thread_main(state, mgmt_listener))?,
];
for t in threads.into_iter() {
for t in threads {
t.join().unwrap()?;
}

View File

@@ -3,9 +3,8 @@ use std::{
thread,
};
use anyhow::bail;
use bytes::Bytes;
use serde::{Deserialize, Serialize};
use serde::Deserialize;
use zenith_utils::{
postgres_backend::{self, query_from_cstring, AuthType, PostgresBackend},
pq_proto::{BeMessage, SINGLE_COL_ROWDESC},
@@ -25,22 +24,23 @@ pub fn thread_main(state: &'static ProxyState, listener: TcpListener) -> anyhow:
socket.set_nodelay(true).unwrap();
thread::spawn(move || {
if let Err(err) = mgmt_conn_main(state, socket) {
if let Err(err) = handle_connection(state, socket) {
println!("error: {}", err);
}
});
}
}
pub fn mgmt_conn_main(state: &'static ProxyState, socket: TcpStream) -> anyhow::Result<()> {
fn handle_connection(state: &ProxyState, socket: TcpStream) -> anyhow::Result<()> {
let mut conn_handler = MgmtHandler { state };
let pgbackend = PostgresBackend::new(socket, AuthType::Trust, None)?;
let pgbackend = PostgresBackend::new(socket, AuthType::Trust, None, true)?;
pgbackend.run(&mut conn_handler)
}
struct MgmtHandler {
state: &'static ProxyState,
struct MgmtHandler<'a> {
state: &'a ProxyState,
}
/// Serialized examples:
// {
// "session_id": "71d6d03e6d93d99a",
@@ -49,7 +49,7 @@ struct MgmtHandler {
// "host": "127.0.0.1",
// "port": 5432,
// "dbname": "stas",
// "user": "stas"
// "user": "stas",
// "password": "mypass"
// }
// }
@@ -60,52 +60,62 @@ struct MgmtHandler {
// "Failure": "oops"
// }
// }
#[derive(Serialize, Deserialize)]
pub struct PsqlSessionResponse {
//
// // to test manually by sending a query to mgmt interface:
// psql -h 127.0.0.1 -p 9999 -c '{"session_id":"4f10dde522e14739","result":{"Success":{"host":"127.0.0.1","port":5432,"dbname":"stas","user":"stas","password":"stas"}}}'
#[derive(Deserialize)]
struct PsqlSessionResponse {
session_id: String,
result: PsqlSessionResult,
}
#[derive(Serialize, Deserialize)]
pub enum PsqlSessionResult {
#[derive(Deserialize)]
enum PsqlSessionResult {
Success(DatabaseInfo),
Failure(String),
}
impl postgres_backend::Handler for MgmtHandler {
impl postgres_backend::Handler for MgmtHandler<'_> {
fn process_query(
&mut self,
pgb: &mut PostgresBackend,
query_string: Bytes,
) -> anyhow::Result<()> {
let query_string = query_from_cstring(query_string);
println!("Got mgmt query: '{}'", std::str::from_utf8(&query_string)?);
let resp: PsqlSessionResponse = serde_json::from_slice(&query_string)?;
let waiters = self.state.waiters.lock().unwrap();
let sender = waiters
.get(&resp.session_id)
.ok_or_else(|| anyhow::Error::msg("psql_session_id is not found"))?;
match resp.result {
PsqlSessionResult::Success(db_info) => {
sender.send(Ok(db_info))?;
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
pgb.flush()?;
Ok(())
}
PsqlSessionResult::Failure(message) => {
sender.send(Err(anyhow::Error::msg(message.clone())))?;
bail!("psql session request failed: {}", message)
}
let res = try_process_query(self, pgb, query_string);
// intercept and log error message
if res.is_err() {
println!("Mgmt query failed: #{:?}", res);
}
res
}
}
fn try_process_query(
mgmt: &mut MgmtHandler,
pgb: &mut PostgresBackend,
query_string: Bytes,
) -> anyhow::Result<()> {
let query_string = query_from_cstring(query_string);
println!("Got mgmt query: '{}'", std::str::from_utf8(&query_string)?);
let resp: PsqlSessionResponse = serde_json::from_slice(&query_string)?;
use PsqlSessionResult::*;
let msg = match resp.result {
Success(db_info) => Ok(db_info),
Failure(message) => Err(message),
};
match mgmt.state.waiters.notify(&resp.session_id, msg) {
Ok(()) => {
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))?
.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?;
}
Err(e) => {
pgb.write_message(&BeMessage::ErrorResponse(e.to_string()))?;
}
}
Ok(())
}

View File

@@ -1,18 +1,12 @@
use crate::cplane_api::CPlaneApi;
use crate::cplane_api::DatabaseInfo;
use crate::cplane_api::{CPlaneApi, DatabaseInfo};
use crate::ProxyState;
use anyhow::bail;
use anyhow::{anyhow, bail};
use std::net::TcpStream;
use std::{io, thread};
use tokio_postgres::NoTls;
use rand::Rng;
use std::io::Write;
use std::{io, sync::mpsc::channel, thread};
use zenith_utils::postgres_backend::Stream;
use zenith_utils::postgres_backend::{PostgresBackend, ProtoState};
use zenith_utils::pq_proto::*;
use zenith_utils::postgres_backend::{self, PostgresBackend, ProtoState, Stream};
use zenith_utils::pq_proto::{BeMessage as Be, FeMessage as Fe, *};
use zenith_utils::sock_split::{ReadStream, WriteStream};
use zenith_utils::{postgres_backend, pq_proto::BeMessage};
///
/// Main proxy listener loop.
@@ -28,264 +22,259 @@ pub fn thread_main(
println!("accepted connection from {}", peer_addr);
socket.set_nodelay(true).unwrap();
thread::spawn(move || {
if let Err(err) = proxy_conn_main(state, socket) {
println!("error: {}", err);
}
});
thread::Builder::new()
.name("Proxy thread".into())
.spawn(move || {
if let Err(err) = proxy_conn_main(state, socket) {
println!("error: {}", err);
}
})?;
}
}
// XXX: clean up fields
// TODO: clean up fields
struct ProxyConnection {
state: &'static ProxyState,
cplane: CPlaneApi,
user: String,
database: String,
pgb: PostgresBackend,
md5_salt: [u8; 4],
psql_session_id: String,
pgb: PostgresBackend,
}
pub fn proxy_conn_main(
state: &'static ProxyState,
socket: std::net::TcpStream,
) -> anyhow::Result<()> {
let mut conn = ProxyConnection {
pub fn proxy_conn_main(state: &'static ProxyState, socket: TcpStream) -> anyhow::Result<()> {
let conn = ProxyConnection {
state,
cplane: CPlaneApi::new(&state.conf.cplane_address),
user: "".into(),
database: "".into(),
psql_session_id: hex::encode(rand::random::<[u8; 8]>()),
pgb: PostgresBackend::new(
socket,
postgres_backend::AuthType::MD5,
state.conf.ssl_config.clone(),
false,
)?,
md5_salt: [0u8; 4],
psql_session_id: "".into(),
};
// Check StartupMessage
// This will set conn.existing_user and we can decide on next actions
conn.handle_startup()?;
let (client, server) = conn.handle_client()?;
// both scenarious here should end up producing database connection string
let db_info = if conn.is_existing_user() {
conn.handle_existing_user()?
} else {
conn.handle_new_user()?
let server = zenith_utils::sock_split::BidiStream::from_tcp(server);
let client = match client {
Stream::Bidirectional(bidi_stream) => bidi_stream,
_ => panic!("invalid stream type"),
};
proxy_pass(conn.pgb, db_info)
proxy(client.split(), server.split())
}
impl ProxyConnection {
fn is_existing_user(&self) -> bool {
self.user.ends_with("@zenith")
fn handle_client(mut self) -> anyhow::Result<(Stream, TcpStream)> {
let mut authenticate = || {
let (username, dbname) = self.handle_startup()?;
// Both scenarios here should end up producing database credentials
if username.ends_with("@zenith") {
self.handle_existing_user(&username, &dbname)
} else {
self.handle_new_user()
}
};
let conn = match authenticate() {
Ok(db_info) => connect_to_db(db_info),
Err(e) => {
// Report the error to the client
self.pgb.write_message(&Be::ErrorResponse(e.to_string()))?;
bail!("failed to handle client: {:?}", e);
}
};
// We'll get rid of this once migration to async is complete
let (pg_version, db_stream) = {
let runtime = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()?;
let (pg_version, stream) = runtime.block_on(conn)?;
let stream = stream.into_std()?;
stream.set_nonblocking(false)?;
(pg_version, stream)
};
// Let the client send new requests
self.pgb
.write_message_noflush(&BeMessage::ParameterStatus(
BeParameterStatusMessage::ServerVersion(&pg_version),
))?
.write_message(&Be::ReadyForQuery)?;
Ok((self.pgb.into_stream(), db_stream))
}
fn handle_startup(&mut self) -> anyhow::Result<()> {
fn handle_startup(&mut self) -> anyhow::Result<(String, String)> {
let have_tls = self.pgb.tls_config.is_some();
let mut encrypted = false;
loop {
let msg = self.pgb.read_message()?;
println!("got message {:?}", msg);
match msg {
Some(FeMessage::StartupMessage(m)) => {
println!("got startup message {:?}", m);
let mut msg = match self.pgb.read_message()? {
Some(Fe::StartupMessage(msg)) => msg,
None => bail!("connection is lost"),
bad => bail!("unexpected message type: {:?}", bad),
};
println!("got message: {:?}", msg);
match m.kind {
StartupRequestCode::NegotiateGss => {
self.pgb
.write_message(&BeMessage::EncryptionResponse(false))?;
}
StartupRequestCode::NegotiateSsl => {
println!("SSL requested");
if self.pgb.tls_config.is_some() {
self.pgb
.write_message(&BeMessage::EncryptionResponse(true))?;
self.pgb.start_tls()?;
encrypted = true;
} else {
self.pgb
.write_message(&BeMessage::EncryptionResponse(false))?;
}
}
StartupRequestCode::Normal => {
if self.state.conf.ssl_config.is_some() && !encrypted {
self.pgb.write_message(&BeMessage::ErrorResponse(
"must connect with TLS".to_string(),
))?;
bail!("client did not connect with TLS");
}
self.user = m
.params
.get("user")
.ok_or_else(|| {
anyhow::Error::msg("user is required in startup packet")
})?
.into();
self.database = m
.params
.get("database")
.ok_or_else(|| {
anyhow::Error::msg("database is required in startup packet")
})?
.into();
break;
}
StartupRequestCode::Cancel => break,
match msg.kind {
StartupRequestCode::NegotiateGss => {
self.pgb.write_message(&Be::EncryptionResponse(false))?;
}
StartupRequestCode::NegotiateSsl => {
self.pgb.write_message(&Be::EncryptionResponse(have_tls))?;
if have_tls {
self.pgb.start_tls()?;
encrypted = true;
}
}
None => {
bail!("connection closed")
}
unexpected => {
bail!("unexpected message type : {:?}", unexpected)
StartupRequestCode::Normal => {
if have_tls && !encrypted {
bail!("must connect with TLS");
}
let mut get_param = |key| {
msg.params
.remove(key)
.ok_or_else(|| anyhow!("{} is missing in startup packet", key))
};
return Ok((get_param("user")?, get_param("database")?));
}
// TODO: implement proper stmt cancellation
StartupRequestCode::Cancel => bail!("query cancellation is not supported"),
}
}
Ok(())
}
fn handle_existing_user(&mut self) -> anyhow::Result<DatabaseInfo> {
// ask password
rand::thread_rng().fill(&mut self.md5_salt);
fn handle_existing_user(&mut self, user: &str, db: &str) -> anyhow::Result<DatabaseInfo> {
let md5_salt = rand::random::<[u8; 4]>();
// Ask password
self.pgb
.write_message(&BeMessage::AuthenticationMD5Password(&self.md5_salt))?;
.write_message(&Be::AuthenticationMD5Password(&md5_salt))?;
self.pgb.state = ProtoState::Authentication; // XXX
// check password
println!("handle_existing_user");
let msg = self.pgb.read_message()?;
println!("got message {:?}", msg);
if let Some(FeMessage::PasswordMessage(m)) = msg {
println!("got password message '{:?}'", m);
// Check password
let msg = match self.pgb.read_message()? {
Some(Fe::PasswordMessage(msg)) => msg,
None => bail!("connection is lost"),
bad => bail!("unexpected message type: {:?}", bad),
};
println!("got message: {:?}", msg);
assert!(self.is_existing_user());
let (_trailing_null, md5_response) = msg
.split_last()
.ok_or_else(|| anyhow!("unexpected password message"))?;
let (_trailing_null, md5_response) = m
.split_last()
.ok_or_else(|| anyhow::Error::msg("unexpected password message"))?;
let cplane = CPlaneApi::new(&self.state.conf.auth_endpoint, &self.state.waiters);
let db_info = cplane.authenticate_proxy_request(
user,
db,
md5_response,
&md5_salt,
&self.psql_session_id,
)?;
if let Err(e) = self.check_auth_md5(md5_response) {
self.pgb
.write_message(&BeMessage::ErrorResponse(format!("{}", e)))?;
bail!("auth failed: {}", e);
} else {
self.pgb
.write_message_noflush(&BeMessage::AuthenticationOk)?;
self.pgb
.write_message_noflush(&BeMessage::ParameterStatus)?;
self.pgb.write_message(&BeMessage::ReadyForQuery)?;
}
}
self.pgb
.write_message_noflush(&Be::AuthenticationOk)?
.write_message_noflush(&BeParameterStatusMessage::encoding())?;
// ok, we are authorized
self.cplane.get_database_uri(&self.user, &self.database)
Ok(db_info)
}
fn handle_new_user(&mut self) -> anyhow::Result<DatabaseInfo> {
let mut psql_session_id_buf = [0u8; 8];
rand::thread_rng().fill(&mut psql_session_id_buf);
self.psql_session_id = hex::encode(psql_session_id_buf);
let greeting = hello_message(&self.state.conf.redirect_uri, &self.psql_session_id);
let hello_message = format!("☀️ Welcome to Zenith!
To proceed with database creation, open the following link:
{redirect_uri}{sess_id}
It needs to be done once and we will send you '.pgpass' file, which will allow you to access or create
databases without opening the browser.
", redirect_uri = self.state.conf.redirect_uri, sess_id = self.psql_session_id);
// First, register this session
let waiter = self.state.waiters.register(self.psql_session_id.clone());
// Give user a URL to spawn a new database
self.pgb
.write_message_noflush(&BeMessage::AuthenticationOk)?;
self.pgb
.write_message_noflush(&BeMessage::ParameterStatus)?;
self.pgb
.write_message(&BeMessage::NoticeResponse(hello_message))?;
// await for database creation
let (tx, rx) = channel::<anyhow::Result<DatabaseInfo>>();
let _ = self
.state
.waiters
.lock()
.unwrap()
.insert(self.psql_session_id.clone(), tx);
.write_message_noflush(&Be::AuthenticationOk)?
.write_message_noflush(&BeParameterStatusMessage::encoding())?
.write_message(&Be::NoticeResponse(greeting))?;
// Wait for web console response
// XXX: respond with error to client
let dbinfo = rx.recv()??;
let db_info = waiter.wait()?.map_err(|e| anyhow!(e))?;
self.pgb.write_message_noflush(&BeMessage::NoticeResponse(
"Connecting to database.".to_string(),
))?;
self.pgb.write_message(&BeMessage::ReadyForQuery)?;
self.pgb
.write_message_noflush(&Be::NoticeResponse("Connecting to database.".into()))?;
Ok(dbinfo)
}
fn check_auth_md5(&self, md5_response: &[u8]) -> anyhow::Result<()> {
assert!(self.is_existing_user());
self.cplane
.check_auth(self.user.as_str(), md5_response, &self.md5_salt)
Ok(db_info)
}
}
fn hello_message(redirect_uri: &str, session_id: &str) -> String {
format!(
concat![
"☀️ Welcome to Zenith!\n",
"To proceed with database creation, open the following link:\n\n",
" {redirect_uri}{session_id}\n\n",
"It needs to be done once and we will send you '.pgpass' file,\n",
"which will allow you to access or create ",
"databases without opening your web browser."
],
redirect_uri = redirect_uri,
session_id = session_id,
)
}
/// Create a TCP connection to a postgres database, authenticate with it, and receive the ReadyForQuery message
async fn connect_to_db(db_info: DatabaseInfo) -> anyhow::Result<tokio::net::TcpStream> {
let mut socket = tokio::net::TcpStream::connect(db_info.socket_addr()).await?;
let config = db_info.conn_string().parse::<tokio_postgres::Config>()?;
let _ = config.connect_raw(&mut socket, NoTls).await?;
Ok(socket)
async fn connect_to_db(db_info: DatabaseInfo) -> anyhow::Result<(String, tokio::net::TcpStream)> {
let mut socket = tokio::net::TcpStream::connect(db_info.socket_addr()?).await?;
let config = tokio_postgres::Config::from(db_info);
let (client, conn) = config.connect_raw(&mut socket, NoTls).await?;
let query = client.query_one("select current_setting('server_version')", &[]);
tokio::pin!(query, conn);
let version = tokio::select!(
x = query => x?.try_get(0)?,
_ = conn => bail!("connection closed too early"),
);
Ok((version, socket))
}
/// Concurrently proxy both directions of the client and server connections
fn proxy(
client_read: ReadStream,
client_write: WriteStream,
server_read: ReadStream,
server_write: WriteStream,
(client_read, client_write): (ReadStream, WriteStream),
(server_read, server_write): (ReadStream, WriteStream),
) -> anyhow::Result<()> {
fn do_proxy(mut reader: ReadStream, mut writer: WriteStream) -> io::Result<()> {
std::io::copy(&mut reader, &mut writer)?;
writer.flush()?;
writer.shutdown(std::net::Shutdown::Both)
fn do_proxy(mut reader: impl io::Read, mut writer: WriteStream) -> io::Result<u64> {
/// FlushWriter will make sure that every message is sent as soon as possible
struct FlushWriter<W>(W);
impl<W: io::Write> io::Write for FlushWriter<W> {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
// `std::io::copy` is guaranteed to exit if we return an error,
// so we can afford to lose `res` in case `flush` fails
let res = self.0.write(buf);
if res.is_ok() {
self.flush()?;
}
res
}
fn flush(&mut self) -> io::Result<()> {
self.0.flush()
}
}
let res = std::io::copy(&mut reader, &mut FlushWriter(&mut writer));
writer.shutdown(std::net::Shutdown::Both)?;
res
}
let client_to_server_jh = thread::spawn(move || do_proxy(client_read, server_write));
let res1 = do_proxy(server_read, client_write);
let res2 = client_to_server_jh.join().unwrap();
res1?;
res2?;
do_proxy(server_read, client_write)?;
client_to_server_jh.join().unwrap()?;
Ok(())
}
/// Proxy a client connection to a postgres database
fn proxy_pass(pgb: PostgresBackend, db_info: DatabaseInfo) -> anyhow::Result<()> {
let runtime = tokio::runtime::Builder::new_current_thread().build()?;
let db_stream = runtime.block_on(connect_to_db(db_info))?;
let db_stream = db_stream.into_std()?;
db_stream.set_nonblocking(false)?;
let db_stream = zenith_utils::sock_split::BidiStream::from_tcp(db_stream);
let (db_read, db_write) = db_stream.split();
let stream = match pgb.into_stream() {
Stream::Bidirectional(bidi_stream) => bidi_stream,
_ => bail!("invalid stream"),
};
let (client_read, client_write) = stream.split();
proxy(client_read, client_write, db_read, db_write)
}

62
proxy/src/state.rs Normal file
View File

@@ -0,0 +1,62 @@
use crate::cplane_api::DatabaseInfo;
use anyhow::{anyhow, ensure, Context};
use rustls::{internal::pemfile, NoClientAuth, ProtocolVersion, ServerConfig};
use std::net::SocketAddr;
use std::sync::Arc;
pub type SslConfig = Arc<ServerConfig>;
pub struct ProxyConfig {
/// main entrypoint for users to connect to
pub proxy_address: SocketAddr,
/// http management endpoint. Upon user account creation control plane
/// will notify us here, so that we can 'unfreeze' user session.
pub mgmt_address: SocketAddr,
/// send unauthenticated users to this URI
pub redirect_uri: String,
/// control plane address where we would check auth.
pub auth_endpoint: String,
pub ssl_config: Option<SslConfig>,
}
pub type ProxyWaiters = crate::waiters::Waiters<Result<DatabaseInfo, String>>;
pub struct ProxyState {
pub conf: ProxyConfig,
pub waiters: ProxyWaiters,
}
impl ProxyState {
pub fn new(conf: ProxyConfig) -> Self {
Self {
conf,
waiters: ProxyWaiters::default(),
}
}
}
pub fn configure_ssl(key_path: &str, cert_path: &str) -> anyhow::Result<SslConfig> {
let key = {
let key_bytes = std::fs::read(key_path).context("SSL key file")?;
let mut keys = pemfile::pkcs8_private_keys(&mut &key_bytes[..])
.map_err(|_| anyhow!("couldn't read TLS keys"))?;
ensure!(keys.len() == 1, "keys.len() = {} (should be 1)", keys.len());
keys.pop().unwrap()
};
let cert_chain = {
let cert_chain_bytes = std::fs::read(cert_path).context("SSL cert file")?;
pemfile::certs(&mut &cert_chain_bytes[..])
.map_err(|_| anyhow!("couldn't read TLS certificates"))?
};
let mut config = ServerConfig::new(NoClientAuth::new());
config.set_single_cert(cert_chain, key)?;
config.versions = vec![ProtocolVersion::TLSv1_3];
Ok(config.into())
}

58
proxy/src/waiters.rs Normal file
View File

@@ -0,0 +1,58 @@
use anyhow::{anyhow, Context};
use std::collections::HashMap;
use std::sync::{mpsc, Mutex};
pub struct Waiters<T>(pub(self) Mutex<HashMap<String, mpsc::Sender<T>>>);
impl<T> Default for Waiters<T> {
fn default() -> Self {
Waiters(Default::default())
}
}
impl<T> Waiters<T> {
pub fn register(&self, key: String) -> Waiter<T> {
let (tx, rx) = mpsc::channel();
// TODO: use `try_insert` (unstable)
let prev = self.0.lock().unwrap().insert(key.clone(), tx);
assert!(matches!(prev, None)); // assert_matches! is nightly-only
Waiter {
receiver: rx,
registry: self,
key,
}
}
pub fn notify(&self, key: &str, value: T) -> anyhow::Result<()>
where
T: Send + Sync + 'static,
{
let tx = self
.0
.lock()
.unwrap()
.remove(key)
.ok_or_else(|| anyhow!("key {} not found", key))?;
tx.send(value).context("channel hangup")
}
}
pub struct Waiter<'a, T> {
receiver: mpsc::Receiver<T>,
registry: &'a Waiters<T>,
key: String,
}
impl<T> Waiter<'_, T> {
pub fn wait(self) -> anyhow::Result<T> {
self.receiver.recv().context("channel hangup")
}
}
impl<T> Drop for Waiter<'_, T> {
fn drop(&mut self) {
self.registry.0.lock().unwrap().remove(&self.key);
}
}

9
pytest.ini Normal file
View File

@@ -0,0 +1,9 @@
[pytest]
addopts =
-m 'not remote_cluster'
markers =
remote_cluster
minversion = 6.0
log_format = %(asctime)s.%(msecs)-3d %(levelname)s [%(filename)s:%(lineno)d] %(message)s
log_date_format = %Y-%m-%d %H:%M:%S
log_cli = true

View File

@@ -8,4 +8,8 @@
# warnings and errors right in the editor.
# In vscode, this setting is Rust-analyzer>Check On Save:Command
cargo clippy "${@:2}" -- -A clippy::new_without_default -A clippy::manual_range_contains -A clippy::comparison_chain
# * `-A unknown_lints` do not warn about unknown lint suppressions
# that people with newer toolchains might use
# * `-D warnings` - fail on any warnings (`cargo` returns non-zero exit status)
cargo clippy "${@:2}" --all-targets --all-features --all --tests -- -A unknown_lints -D warnings

510
scripts/coverage Executable file
View File

@@ -0,0 +1,510 @@
#!/usr/bin/env python3
# Here'a good link in case you're interested in learning more
# about current deficiencies of rust code coverage story:
# https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+instrument-coverage+label%3AA-code-coverage
#
# Also a couple of inspirational tools which I deliberately ended up not using:
# * https://github.com/mozilla/grcov
# * https://github.com/taiki-e/cargo-llvm-cov
# * https://github.com/llvm/llvm-project/tree/main/llvm/test/tools/llvm-cov
from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from tempfile import TemporaryDirectory
from textwrap import dedent
from typing import Any, Iterable, List, Optional
import argparse
import json
import os
import shutil
import subprocess
import sys
def intersperse(sep: Any, iterable: Iterable[Any]):
fst = True
for item in iterable:
if not fst:
yield sep
fst = False
yield item
def find_demangler(demangler=None):
known_tools = ['c++filt', 'rustfilt', 'llvm-cxxfilt']
if demangler:
# Explicit argument has precedence over `known_tools`
demanglers = [demangler]
else:
demanglers = known_tools
for demangler in demanglers:
if shutil.which(demangler):
return demangler
raise Exception(' '.join([
'Failed to find symbol demangler.',
'Please install it or provide another tool',
f"(e.g. {', '.join(known_tools)})",
]))
class Cargo:
def __init__(self, cwd: Path):
self.cwd = cwd
self.target_dir = Path(os.environ.get('CARGO_TARGET_DIR', cwd / 'target')).resolve()
self._rustlib_dir = None
@property
def rustlib_dir(self):
if not self._rustlib_dir:
cmd = [
'cargo',
'-Zunstable-options',
'rustc',
'--print=target-libdir',
]
self._rustlib_dir = Path(subprocess.check_output(cmd, cwd=self.cwd, text=True)).parent
return self._rustlib_dir
def binaries(self, profile: str) -> List[str]:
executables = []
# This will emit json messages containing test binaries names
cmd = [
'cargo',
'test',
'--no-run',
'--message-format=json',
]
env = dict(os.environ, PROFILE=profile)
output = subprocess.check_output(cmd, cwd=self.cwd, env=env, text=True)
for line in output.splitlines(keepends=False):
meta = json.loads(line)
exe = meta.get('executable')
if exe:
executables.append(exe)
# Metadata contains crate names, which can be used
# to recover names of executables, e.g. `pageserver`
cmd = [
'cargo',
'metadata',
'--format-version=1',
'--no-deps',
]
meta = json.loads(subprocess.check_output(cmd, cwd=self.cwd))
for pkg in meta.get('packages', []):
for target in pkg.get('targets', []):
if 'bin' in target['kind']:
exe = self.target_dir / profile / target['name']
if exe.exists():
executables.append(str(exe))
return executables
@dataclass
class LLVM:
cargo: Cargo
def resolve_tool(self, name: str) -> str:
exe = self.cargo.rustlib_dir / 'bin' / name
if exe.exists():
return str(exe)
if not shutil.which(name):
# Show a user-friendly warning
raise Exception(' '.join([
f"It appears that you don't have `{name}` installed.",
"Please execute `rustup component add llvm-tools-preview`,",
"or install it via your package manager of choice.",
"LLVM tools should be the same version as LLVM in `rustc --version --verbose`.",
]))
return name
def profdata(self, input_dir: Path, output_profdata: Path):
profraws = [f for f in input_dir.iterdir() if f.suffix == '.profraw']
if not profraws:
raise Exception(f'No profraw files found at {input_dir}')
with open(input_dir / 'profraw.list', 'w') as input_files:
profraw_mtime = 0
for profraw in profraws:
profraw_mtime = max(profraw_mtime, profraw.stat().st_mtime_ns)
print(profraw, file=input_files)
input_files.flush()
try:
profdata_mtime = output_profdata.stat().st_mtime_ns
except FileNotFoundError:
profdata_mtime = 0
# An obvious make-ish optimization
if profraw_mtime >= profdata_mtime:
subprocess.check_call([
self.resolve_tool('llvm-profdata'),
'merge',
'-sparse',
f'-input-files={input_files.name}',
f'-output={output_profdata}',
])
def _cov(self,
*extras,
subcommand: str,
profdata: Path,
objects: List[str],
sources: List[str],
demangler: Optional[str] = None) -> None:
cwd = self.cargo.cwd
objects = list(intersperse('-object', objects))
extras = list(extras)
# For some reason `rustc` produces relative paths to src files,
# so we force it to cut the $PWD prefix.
# see: https://github.com/rust-lang/rust/issues/34701#issuecomment-739809584
if sources:
extras.append(f'-path-equivalence=.,{cwd.resolve()}')
if demangler:
extras.append(f'-Xdemangler={demangler}')
cmd = [
self.resolve_tool('llvm-cov'),
subcommand, # '-dump-collected-paths', # classified debug flag
'-instr-profile',
str(profdata),
*extras,
*objects,
*sources,
]
subprocess.check_call(cmd, cwd=cwd)
def cov_report(self, **kwargs) -> None:
self._cov(subcommand='report', **kwargs)
def cov_export(self, *, kind: str, **kwargs) -> None:
extras = [f'-format={kind}']
self._cov(subcommand='export', *extras, **kwargs)
def cov_show(self, *, kind: str, output_dir: Optional[Path] = None, **kwargs) -> None:
extras = [f'-format={kind}']
if output_dir:
extras.append(f'-output-dir={output_dir}')
self._cov(subcommand='show', *extras, **kwargs)
@dataclass
class Report(ABC):
""" Common properties of a coverage report """
llvm: LLVM
demangler: str
profdata: Path
objects: List[str]
sources: List[str]
def _common_kwargs(self):
return dict(profdata=self.profdata,
objects=self.objects,
sources=self.sources,
demangler=self.demangler)
@abstractmethod
def generate(self):
pass
def open(self):
# Do nothing by default
pass
class SummaryReport(Report):
def generate(self):
self.llvm.cov_report(**self._common_kwargs())
class TextReport(Report):
def generate(self):
self.llvm.cov_show(kind='text', **self._common_kwargs())
class LcovReport(Report):
def generate(self):
self.llvm.cov_export(kind='lcov', **self._common_kwargs())
@dataclass
class HtmlReport(Report):
output_dir: Path
def generate(self):
self.llvm.cov_show(kind='html', output_dir=self.output_dir, **self._common_kwargs())
print(f'HTML report is located at `{self.output_dir}`')
def open(self):
tool = dict(linux='xdg-open', darwin='open').get(sys.platform)
if not tool:
raise Exception(f'Unknown platform {sys.platform}')
subprocess.check_call([tool, self.output_dir / 'index.html'],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL)
@dataclass
class GithubPagesReport(HtmlReport):
output_dir: Path
commit_url: str
def generate(self):
def index_path(path):
return path / 'index.html'
common = self._common_kwargs()
# Provide default sources if there's none
common.setdefault('sources', ['.'])
self.llvm.cov_show(kind='html', output_dir=self.output_dir, **common)
shutil.copy(index_path(self.output_dir), self.output_dir / 'local.html')
with TemporaryDirectory() as tmp:
output_dir = Path(tmp)
args = dict(common, sources=[])
self.llvm.cov_show(kind='html', output_dir=output_dir, **args)
shutil.copy(index_path(output_dir), self.output_dir / 'all.html')
with open(index_path(self.output_dir), 'w') as index:
commit_sha = self.commit_url.rsplit('/', maxsplit=1)[-1][:10]
html = f"""
<!DOCTYPE html>
<html>
<head>
<title>Coverage ({commit_sha})</title>
</head>
<body>
<h1>
Coverage report for commit
<a href="{self.commit_url}">
{commit_sha}
</a>
</h1>
<p>
<a href="./local.html">
<b>Show only local sources</b>
</a>
</p>
<p>
<a href="./all.html">
Show all sources (including dependencies)
</a>
</p>
</body>
</html>
"""
index.write(dedent(html))
print(f'HTML report is located at `{self.output_dir}`')
class State:
def __init__(self, cwd: Path, top_dir: Optional[Path], profraw_prefix: Optional[str]):
# Use hostname by default
profraw_prefix = profraw_prefix or '%h'
self.cwd = cwd
self.cargo = Cargo(self.cwd)
self.llvm = LLVM(self.cargo)
self.top_dir = top_dir or self.cargo.target_dir / 'coverage'
self.report_dir = self.top_dir / 'report'
# Directory for raw coverage data emitted by executables
self.profraw_dir = self.top_dir / 'profraw'
self.profraw_dir.mkdir(parents=True, exist_ok=True)
# Aggregated coverage data
self.profdata_file = self.top_dir / 'coverage.profdata'
# Dump all coverage data files into a dedicated directory.
# Each filename is parameterized by PID & executable's signature.
os.environ['LLVM_PROFILE_FILE'] = str(self.profraw_dir /
f'cov-{profraw_prefix}-%p-%m.profraw')
os.environ['RUSTFLAGS'] = ' '.join([
os.environ.get('RUSTFLAGS', ''),
# Enable LLVM's source-based coverage
# see: https://clang.llvm.org/docs/SourceBasedCodeCoverage.html
# see: https://blog.rust-lang.org/inside-rust/2020/11/12/source-based-code-coverage.html
'-Zinstrument-coverage',
# Link every bit of code to prevent "holes" in coverage report
# see: https://doc.rust-lang.org/rustc/codegen-options/index.html#link-dead-code
'-Clink-dead-code',
# Some of the paths that `rustc` embeds into binaries are absolute, others are relative.
# The point is, we can't have both, because depending on `-path-equivalence`, `llvm-cov`
# either will cripple absolute paths or won't be able to show relative paths at all.
# There's no way to turn relative paths into absolute, so we strip $PWD prefix.
# Only source files of deps (e.g. `$HOME/.cargo`) will keep their absolute paths,
# but we won't include them in report by default (but see `--all`).
f'--remap-path-prefix {self.cwd}=',
])
# XXX: God, have mercy on our souls...
# see: https://github.com/rust-lang/rust/pull/90132
os.environ['RUSTC_BOOTSTRAP'] = '1'
def do_run(self, args):
subprocess.check_call([*args.command, *args.args])
def do_report(self, args):
if args.all and args.sources:
raise Exception('--all should not be used with sources')
# see man for `llvm-cov show [sources]`
if args.all:
sources = []
elif not args.sources:
sources = ['.']
else:
sources = args.sources
print('* Merging profraw files')
self.llvm.profdata(self.profraw_dir, self.profdata_file)
objects = []
if args.input_objects:
print('* Collecting object files using --input-objects')
with open(args.input_objects) as f:
objects.extend(f.read().splitlines(keepends=False))
if args.cargo_objects == 'true' or (args.cargo_objects == 'auto'
and not args.input_objects):
print('* Collecting object files using cargo')
objects.extend(self.cargo.binaries(args.profile))
params = dict(llvm=self.llvm,
demangler=find_demangler(args.demangler),
profdata=self.profdata_file,
objects=objects,
sources=sources)
formats = {
'html':
lambda: HtmlReport(**params, output_dir=self.report_dir),
'text':
lambda: TextReport(**params),
'lcov':
lambda: LcovReport(**params),
'summary':
lambda: SummaryReport(**params),
'github':
lambda: GithubPagesReport(
**params, output_dir=self.report_dir, commit_url=args.commit_url),
}
report = formats.get(args.format)()
if not report:
raise Exception('Format `{args.format}` is not supported')
print(f'* Rendering coverage report ({args.format})')
report.generate()
if args.open:
print('* Opening the report')
report.open()
def do_clean(self, args):
# Wipe everything if no filters have been provided
if not (args.report or args.prof):
shutil.rmtree(self.top_dir, ignore_errors=True)
else:
if args.report:
shutil.rmtree(self.report_dir, ignore_errors=True)
if args.prof:
self.profdata_file.unlink(missing_ok=True)
def main():
app = sys.argv[0]
example = f"""
prerequisites:
# alternatively, install a system package for `llvm-tools`
rustup component add llvm-tools-preview
self-contained example:
{app} run make
{app} run pipenv run pytest test_runner
{app} run cargo test
{app} report --open
"""
parser = argparse.ArgumentParser(description='Coverage report builder',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=example)
parser.add_argument('--dir', type=Path, help='output directory')
parser.add_argument('--profraw-prefix', metavar='STRING', type=str)
commands = parser.add_subparsers(title='commands', dest='subparser_name')
p_run = commands.add_parser('run', help='run a command with magic env')
p_run.add_argument('command', nargs=1)
p_run.add_argument('args', nargs=argparse.REMAINDER)
p_report = commands.add_parser('report', help='generate a coverage report')
p_report.add_argument('--profile',
default='debug',
choices=('debug', 'release'),
help='cargo build profile')
p_report.add_argument('--format',
default='html',
choices=('html', 'text', 'summary', 'lcov', 'github'),
help='report format')
p_report.add_argument('--input-objects',
metavar='FILE',
type=Path,
help='file containing list of binaries')
p_report.add_argument('--cargo-objects',
default='auto',
choices=('auto', 'true', 'false'),
help='use cargo for auto discovery of binaries')
p_report.add_argument('--commit-url', type=str, help='required for --format=github')
p_report.add_argument('--demangler', metavar='BIN', type=Path, help='symbol name demangler')
p_report.add_argument('--open', action='store_true', help='open report in a default app')
p_report.add_argument('--all', action='store_true', help='show everything, e.g. deps')
p_report.add_argument('sources', nargs='*', type=Path, help='source file or directory')
p_clean = commands.add_parser('clean', help='wipe coverage artifacts')
p_clean.add_argument('--report', action='store_true', help='pick generated report')
p_clean.add_argument('--prof', action='store_true', help='pick *.profdata & *.profraw')
args = parser.parse_args()
state = State(cwd=Path.cwd(), top_dir=args.dir, profraw_prefix=args.profraw_prefix)
commands = {
'run': state.do_run,
'report': state.do_report,
'clean': state.do_clean,
}
action = commands.get(args.subparser_name)
if action:
action(args)
else:
parser.print_help()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,27 @@
#!/bin/bash
# this is a shortcut script to avoid duplication in CI
set -eux -o pipefail
SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
git clone https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-perf-data.git
cd zenith-perf-data
mkdir -p reports/
mkdir -p data/$REPORT_TO
cp $REPORT_FROM/* data/$REPORT_TO
echo "Generating report"
pipenv run python $SCRIPT_DIR/generate_perf_report_page.py --input-dir data/$REPORT_TO --out reports/$REPORT_TO.html
echo "Uploading perf result"
git add data reports
git \
-c "user.name=vipvap" \
-c "user.email=vipvap@zenith.tech" \
commit \
--author="vipvap <vipvap@zenith.tech>" \
-m "add performance test result for $GITHUB_SHA zenith revision"
git push https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-perf-data.git master

View File

@@ -0,0 +1,207 @@
#!/usr/bin/env python3
import argparse
from dataclasses import dataclass
from pathlib import Path
import json
from typing import Any, Dict, List, Optional, Tuple, cast
from jinja2 import Template
# skip 'input' columns. They are included in the header and just blow the table
EXCLUDE_COLUMNS = frozenset({
'scale',
'duration',
'number_of_clients',
'number_of_threads',
'init_start_timestamp',
'init_end_timestamp',
'run_start_timestamp',
'run_end_timestamp',
})
KEY_EXCLUDE_FIELDS = frozenset({
'init_start_timestamp',
'init_end_timestamp',
'run_start_timestamp',
'run_end_timestamp',
})
NEGATIVE_COLOR = 'negative'
POSITIVE_COLOR = 'positive'
@dataclass
class SuitRun:
revision: str
values: Dict[str, Any]
@dataclass
class SuitRuns:
platform: str
suit: str
common_columns: List[Tuple[str, str]]
value_columns: List[str]
runs: List[SuitRun]
@dataclass
class RowValue:
value: str
color: str
ratio: str
def get_columns(values: List[Dict[Any, Any]]) -> Tuple[List[Tuple[str, str]], List[str]]:
value_columns = []
common_columns = []
for item in values:
if item['name'] in KEY_EXCLUDE_FIELDS:
continue
if item['report'] != 'test_param':
value_columns.append(cast(str, item['name']))
else:
common_columns.append((cast(str, item['name']), cast(str, item['value'])))
value_columns.sort()
common_columns.sort(key=lambda x: x[0]) # sort by name
return common_columns, value_columns
def format_ratio(ratio: float, report: str) -> Tuple[str, str]:
color = ''
sign = '+' if ratio > 0 else ''
if abs(ratio) < 0.05:
return f'&nbsp({sign}{ratio:.2f})', color
if report not in {'test_param', 'higher_is_better', 'lower_is_better'}:
raise ValueError(f'Unknown report type: {report}')
if report == 'test_param':
return f'{ratio:.2f}', color
if ratio > 0:
if report == 'higher_is_better':
color = POSITIVE_COLOR
elif report == 'lower_is_better':
color = NEGATIVE_COLOR
elif ratio < 0:
if report == 'higher_is_better':
color = NEGATIVE_COLOR
elif report == 'lower_is_better':
color = POSITIVE_COLOR
return f'&nbsp({sign}{ratio:.2f})', color
def extract_value(name: str, suit_run: SuitRun) -> Optional[Dict[str, Any]]:
for item in suit_run.values['data']:
if item['name'] == name:
return cast(Dict[str, Any], item)
return None
def get_row_values(columns: List[str], run_result: SuitRun,
prev_result: Optional[SuitRun]) -> List[RowValue]:
row_values = []
for column in columns:
current_value = extract_value(column, run_result)
if current_value is None:
# should never happen
raise ValueError(f'{column} not found in {run_result.values}')
value = current_value["value"]
if isinstance(value, float):
value = f'{value:.2f}'
if prev_result is None:
row_values.append(RowValue(value, '', ''))
continue
prev_value = extract_value(column, prev_result)
if prev_value is None:
# this might happen when new metric is added and there is no value for it in previous run
# let this be here, TODO add proper handling when this actually happens
raise ValueError(f'{column} not found in previous result')
ratio = float(value) / float(prev_value['value']) - 1
ratio_display, color = format_ratio(ratio, current_value['report'])
row_values.append(RowValue(value, color, ratio_display))
return row_values
@dataclass
class SuiteRunTableRow:
revision: str
values: List[RowValue]
def prepare_rows_from_runs(value_columns: List[str], runs: List[SuitRun]) -> List[SuiteRunTableRow]:
rows = []
prev_run = None
for run in runs:
rows.append(
SuiteRunTableRow(revision=run.revision,
values=get_row_values(value_columns, run, prev_run)))
prev_run = run
return rows
def main(args: argparse.Namespace) -> None:
input_dir = Path(args.input_dir)
grouped_runs: Dict[str, SuitRuns] = {}
# we have files in form: <ctr>_<rev>.json
# fill them in the hashmap so we have grouped items for the
# same run configuration (scale, duration etc.) ordered by counter.
for item in sorted(input_dir.iterdir(), key=lambda x: int(x.name.split('_')[0])):
run_data = json.loads(item.read_text())
revision = run_data['revision']
for suit_result in run_data['result']:
key = "{}{}".format(run_data['platform'], suit_result['suit'])
# pack total duration as a synthetic value
total_duration = suit_result['total_duration']
suit_result['data'].append({
'name': 'total_duration',
'value': total_duration,
'unit': 's',
'report': 'lower_is_better',
})
common_columns, value_columns = get_columns(suit_result['data'])
grouped_runs.setdefault(
key,
SuitRuns(
platform=run_data['platform'],
suit=suit_result['suit'],
common_columns=common_columns,
value_columns=value_columns,
runs=[],
),
)
grouped_runs[key].runs.append(SuitRun(revision=revision, values=suit_result))
context = {}
for result in grouped_runs.values():
suit = result.suit
context[suit] = {
'common_columns': result.common_columns,
'value_columns': result.value_columns,
'platform': result.platform,
# reverse the order so newest results are on top of the table
'rows': reversed(prepare_rows_from_runs(result.value_columns, result.runs)),
}
template = Template((Path(__file__).parent / 'perf_report_template.html').read_text())
Path(args.out).write_text(template.render(context=context))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--input-dir',
dest='input_dir',
required=True,
help='Directory with jsons generated by the test suite',
)
parser.add_argument('--out', required=True, help='Output html file path')
args = parser.parse_args()
main(args)

136
scripts/git-upload Executable file
View File

@@ -0,0 +1,136 @@
#!/usr/bin/env python3
from contextlib import contextmanager
from tempfile import TemporaryDirectory
from pathlib import Path
import argparse
import os
import shutil
import subprocess
import sys
def absolute_path(path):
return Path(path).resolve()
def relative_path(path):
path = Path(path)
if path.is_absolute():
raise Exception(f'path `{path}` must be relative!')
return path
@contextmanager
def chdir(cwd: Path):
old = os.getcwd()
os.chdir(cwd)
try:
yield cwd
finally:
os.chdir(old)
def run(cmd, *args, **kwargs):
print('$', ' '.join(cmd))
subprocess.check_call(cmd, *args, **kwargs)
class GitRepo:
def __init__(self, url):
self.url = url
self.cwd = TemporaryDirectory()
subprocess.check_call([
'git',
'clone',
str(url),
self.cwd.name,
])
def is_dirty(self):
res = subprocess.check_output(['git', 'status', '--porcelain'], text=True).strip()
return bool(res)
def update(self, message, action, branch=None):
with chdir(self.cwd.name):
if not branch:
cmd = ['git', 'branch', '--show-current']
branch = subprocess.check_output(cmd, text=True).strip()
# Run action in repo's directory
action()
run(['git', 'add', '.'])
if not self.is_dirty():
print('No changes detected, quitting')
return
run([
'git',
'-c',
'user.name=vipvap',
'-c',
'user.email=vipvap@zenith.tech',
'commit',
'--author="vipvap <vipvap@zenith.tech>"',
f'--message={message}',
])
for _ in range(5):
try:
run(['git', 'fetch', 'origin', branch])
run(['git', 'rebase', f'origin/{branch}'])
run(['git', 'push', 'origin', branch])
return
except subprocess.CalledProcessError as e:
print(f'failed to update branch `{branch}`: {e}', file=sys.stderr)
raise Exception(f'failed to update branch `{branch}`')
def do_copy(args):
src = args.src
dst = args.dst
try:
if src.is_dir():
shutil.copytree(src, dst)
else:
shutil.copy(src, dst)
except FileExistsError:
if args.forbid_overwrite:
raise
def main():
parser = argparse.ArgumentParser(description='Git upload tool')
parser.add_argument('--repo', type=str, metavar='URL', required=True, help='git repo url')
parser.add_argument('--message', type=str, metavar='TEXT', help='commit message')
commands = parser.add_subparsers(title='commands', dest='subparser_name')
p_copy = commands.add_parser('copy', help='copy file into the repo')
p_copy.add_argument('src', type=absolute_path, help='source path')
p_copy.add_argument('dst', type=relative_path, help='relative dest path')
p_copy.add_argument('--forbid-overwrite', action='store_true', help='do not allow overwrites')
args = parser.parse_args()
commands = {
'copy': do_copy,
}
action = commands.get(args.subparser_name)
if action:
message = args.message or 'update'
GitRepo(args.repo).update(message, lambda: action(args))
else:
parser.print_usage()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,52 @@
<!DOCTYPE html>
<html>
<body>
<style>
table,
th,
td {
border: 1px solid black;
border-collapse: collapse;
}
.positive {
background-color: rgba(0, 255, 0, 0.8)
}
.negative {
background-color: rgba(255, 0, 0, 0.65)
}
</style>
<h2>Zenith Performance Tests</h2>
{% for suit_name, suit_data in context.items() %}
<h3>Runs for {{ suit_name }} </h3>
<b>platform:</b> {{ suit_data.platform }}<br>
{% for common_column_name, common_column_value in suit_data.common_columns %}
<b>{{ common_column_name }}</b>: {{ common_column_value }}<br>
{% endfor %}
<br>
<table>
<tr>
<th>revision</th>
{% for column_name in suit_data.value_columns %}
<th>{{ column_name }}</th>
{% endfor %}
</tr>
{% for row in suit_data.rows %}
<tr>
<td><a href=https://github.com/zenithdb/zenith/commit/{{ row.revision }}>{{ row.revision[:6] }}</a></td>
{% for column_value in row.values %}
<td class="{{ column_value.color }}">{{ column_value.value }}{{column_value.ratio}}</td>
{% endfor %}
</tr>
{% endfor %}
</table>
{% endfor %}
</body>
</html>

View File

@@ -10,8 +10,11 @@ max-line-length = 100
[yapf]
based_on_style = pep8
column_limit = 100
split_all_top_level_comma_separated_values = true
[mypy]
# mypy uses regex
exclude = ^vendor/
# some tests don't typecheck when this flag is set
check_untyped_defs = false
@@ -21,7 +24,11 @@ disallow_untyped_decorators = false
disallow_untyped_defs = false
strict = true
[mypy-psycopg2.*]
[mypy-asyncpg.*]
# There is some work in progress, though: https://github.com/MagicStack/asyncpg/pull/577
ignore_missing_imports = true
[mypy-cached_property.*]
ignore_missing_imports = true
[mypy-pytest.*]

View File

@@ -1,20 +0,0 @@
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"
[packages]
pytest = ">=6.0.0"
psycopg2 = "*"
typing-extensions = "*"
pyjwt = {extras = ["crypto"], version = "*"}
requests = "*"
[dev-packages]
yapf = "*"
flake8 = "*"
mypy = "*"
[requires]
# we need at least 3.6, but pipenv doesn't allow to say this directly
python_version = "3"

328
test_runner/Pipfile.lock generated
View File

@@ -1,328 +0,0 @@
{
"_meta": {
"hash": {
"sha256": "b666740289d9c82797e5c39b2a7f0074c865c9183ee878ce4fa5cda7928506ea"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.python.org/simple",
"verify_ssl": true
}
]
},
"default": {
"attrs": {
"hashes": [
"sha256:149e90d6d8ac20db7a955ad60cf0e6881a3f20d37096140088356da6c716b0b1",
"sha256:ef6aaac3ca6cd92904cdd0d83f629a15f18053ec84e6432106f7a4d04ae4f5fb"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
"version": "==21.2.0"
},
"certifi": {
"hashes": [
"sha256:2bbf76fd432960138b3ef6dda3dde0544f27cbf8546c458e60baf371917ba9ee",
"sha256:50b1e4f8446b06f41be7dd6338db18e0990601dce795c2b1686458aa7e8fa7d8"
],
"version": "==2021.5.30"
},
"cffi": {
"hashes": [
"sha256:06c54a68935738d206570b20da5ef2b6b6d92b38ef3ec45c5422c0ebaf338d4d",
"sha256:0c0591bee64e438883b0c92a7bed78f6290d40bf02e54c5bf0978eaf36061771",
"sha256:19ca0dbdeda3b2615421d54bef8985f72af6e0c47082a8d26122adac81a95872",
"sha256:22b9c3c320171c108e903d61a3723b51e37aaa8c81255b5e7ce102775bd01e2c",
"sha256:26bb2549b72708c833f5abe62b756176022a7b9a7f689b571e74c8478ead51dc",
"sha256:33791e8a2dc2953f28b8d8d300dde42dd929ac28f974c4b4c6272cb2955cb762",
"sha256:3c8d896becff2fa653dc4438b54a5a25a971d1f4110b32bd3068db3722c80202",
"sha256:4373612d59c404baeb7cbd788a18b2b2a8331abcc84c3ba40051fcd18b17a4d5",
"sha256:487d63e1454627c8e47dd230025780e91869cfba4c753a74fda196a1f6ad6548",
"sha256:48916e459c54c4a70e52745639f1db524542140433599e13911b2f329834276a",
"sha256:4922cd707b25e623b902c86188aca466d3620892db76c0bdd7b99a3d5e61d35f",
"sha256:55af55e32ae468e9946f741a5d51f9896da6b9bf0bbdd326843fec05c730eb20",
"sha256:57e555a9feb4a8460415f1aac331a2dc833b1115284f7ded7278b54afc5bd218",
"sha256:5d4b68e216fc65e9fe4f524c177b54964af043dde734807586cf5435af84045c",
"sha256:64fda793737bc4037521d4899be780534b9aea552eb673b9833b01f945904c2e",
"sha256:6d6169cb3c6c2ad50db5b868db6491a790300ade1ed5d1da29289d73bbe40b56",
"sha256:7bcac9a2b4fdbed2c16fa5681356d7121ecabf041f18d97ed5b8e0dd38a80224",
"sha256:80b06212075346b5546b0417b9f2bf467fea3bfe7352f781ffc05a8ab24ba14a",
"sha256:818014c754cd3dba7229c0f5884396264d51ffb87ec86e927ef0be140bfdb0d2",
"sha256:8eb687582ed7cd8c4bdbff3df6c0da443eb89c3c72e6e5dcdd9c81729712791a",
"sha256:99f27fefe34c37ba9875f224a8f36e31d744d8083e00f520f133cab79ad5e819",
"sha256:9f3e33c28cd39d1b655ed1ba7247133b6f7fc16fa16887b120c0c670e35ce346",
"sha256:a8661b2ce9694ca01c529bfa204dbb144b275a31685a075ce123f12331be790b",
"sha256:a9da7010cec5a12193d1af9872a00888f396aba3dc79186604a09ea3ee7c029e",
"sha256:aedb15f0a5a5949ecb129a82b72b19df97bbbca024081ed2ef88bd5c0a610534",
"sha256:b315d709717a99f4b27b59b021e6207c64620790ca3e0bde636a6c7f14618abb",
"sha256:ba6f2b3f452e150945d58f4badd92310449876c4c954836cfb1803bdd7b422f0",
"sha256:c33d18eb6e6bc36f09d793c0dc58b0211fccc6ae5149b808da4a62660678b156",
"sha256:c9a875ce9d7fe32887784274dd533c57909b7b1dcadcc128a2ac21331a9765dd",
"sha256:c9e005e9bd57bc987764c32a1bee4364c44fdc11a3cc20a40b93b444984f2b87",
"sha256:d2ad4d668a5c0645d281dcd17aff2be3212bc109b33814bbb15c4939f44181cc",
"sha256:d950695ae4381ecd856bcaf2b1e866720e4ab9a1498cba61c602e56630ca7195",
"sha256:e22dcb48709fc51a7b58a927391b23ab37eb3737a98ac4338e2448bef8559b33",
"sha256:e8c6a99be100371dbb046880e7a282152aa5d6127ae01783e37662ef73850d8f",
"sha256:e9dc245e3ac69c92ee4c167fbdd7428ec1956d4e754223124991ef29eb57a09d",
"sha256:eb687a11f0a7a1839719edd80f41e459cc5366857ecbed383ff376c4e3cc6afd",
"sha256:eb9e2a346c5238a30a746893f23a9535e700f8192a68c07c0258e7ece6ff3728",
"sha256:ed38b924ce794e505647f7c331b22a693bee1538fdf46b0222c4717b42f744e7",
"sha256:f0010c6f9d1a4011e429109fda55a225921e3206e7f62a0c22a35344bfd13cca",
"sha256:f0c5d1acbfca6ebdd6b1e3eded8d261affb6ddcf2186205518f1428b8569bb99",
"sha256:f10afb1004f102c7868ebfe91c28f4a712227fe4cb24974350ace1f90e1febbf",
"sha256:f174135f5609428cc6e1b9090f9268f5c8935fddb1b25ccb8255a2d50de6789e",
"sha256:f3ebe6e73c319340830a9b2825d32eb6d8475c1dac020b4f0aa774ee3b898d1c",
"sha256:f627688813d0a4140153ff532537fbe4afea5a3dffce1f9deb7f91f848a832b5",
"sha256:fd4305f86f53dfd8cd3522269ed7fc34856a8ee3709a5e28b2836b2db9d4cd69"
],
"version": "==1.14.6"
},
"charset-normalizer": {
"hashes": [
"sha256:0c8911edd15d19223366a194a513099a302055a962bca2cec0f54b8b63175d8b",
"sha256:f23667ebe1084be45f6ae0538e4a5a865206544097e4e8bbcacf42cd02a348f3"
],
"markers": "python_version >= '3'",
"version": "==2.0.4"
},
"cryptography": {
"hashes": [
"sha256:0f1212a66329c80d68aeeb39b8a16d54ef57071bf22ff4e521657b27372e327d",
"sha256:1e056c28420c072c5e3cb36e2b23ee55e260cb04eee08f702e0edfec3fb51959",
"sha256:240f5c21aef0b73f40bb9f78d2caff73186700bf1bc6b94285699aff98cc16c6",
"sha256:26965837447f9c82f1855e0bc8bc4fb910240b6e0d16a664bb722df3b5b06873",
"sha256:37340614f8a5d2fb9aeea67fd159bfe4f5f4ed535b1090ce8ec428b2f15a11f2",
"sha256:3d10de8116d25649631977cb37da6cbdd2d6fa0e0281d014a5b7d337255ca713",
"sha256:3d8427734c781ea5f1b41d6589c293089704d4759e34597dce91014ac125aad1",
"sha256:7ec5d3b029f5fa2b179325908b9cd93db28ab7b85bb6c1db56b10e0b54235177",
"sha256:8e56e16617872b0957d1c9742a3f94b43533447fd78321514abbe7db216aa250",
"sha256:b01fd6f2737816cb1e08ed4807ae194404790eac7ad030b34f2ce72b332f5586",
"sha256:bf40af59ca2465b24e54f671b2de2c59257ddc4f7e5706dbd6930e26823668d3",
"sha256:de4e5f7f68220d92b7637fc99847475b59154b7a1b3868fb7385337af54ac9ca",
"sha256:eb8cc2afe8b05acbd84a43905832ec78e7b3873fb124ca190f574dca7389a87d",
"sha256:ee77aa129f481be46f8d92a1a7db57269a2f23052d5f2433b4621bb457081cc9"
],
"version": "==3.4.7"
},
"idna": {
"hashes": [
"sha256:14475042e284991034cb48e06f6851428fb14c4dc953acd9be9a5e95c7b6dd7a",
"sha256:467fbad99067910785144ce333826c71fb0e63a425657295239737f7ecd125f3"
],
"markers": "python_version >= '3'",
"version": "==3.2"
},
"iniconfig": {
"hashes": [
"sha256:011e24c64b7f47f6ebd835bb12a743f2fbe9a26d4cecaa7f53bc4f35ee9da8b3",
"sha256:bc3af051d7d14b2ee5ef9969666def0cd1a000e121eaea580d4a313df4b37f32"
],
"version": "==1.1.1"
},
"packaging": {
"hashes": [
"sha256:7dc96269f53a4ccec5c0670940a4281106dd0bb343f47b7471f779df49c2fbe7",
"sha256:c86254f9220d55e31cc94d69bade760f0847da8000def4dfe1c6b872fd14ff14"
],
"markers": "python_version >= '3.6'",
"version": "==21.0"
},
"pluggy": {
"hashes": [
"sha256:15b2acde666561e1298d71b523007ed7364de07029219b604cf808bfa1c765b0",
"sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==0.13.1"
},
"psycopg2": {
"hashes": [
"sha256:079d97fc22de90da1d370c90583659a9f9a6ee4007355f5825e5f1c70dffc1fa",
"sha256:2087013c159a73e09713294a44d0c8008204d06326006b7f652bef5ace66eebb",
"sha256:2c992196719fadda59f72d44603ee1a2fdcc67de097eea38d41c7ad9ad246e62",
"sha256:7640e1e4d72444ef012e275e7b53204d7fab341fb22bc76057ede22fe6860b25",
"sha256:7f91312f065df517187134cce8e395ab37f5b601a42446bdc0f0d51773621854",
"sha256:830c8e8dddab6b6716a4bf73a09910c7954a92f40cf1d1e702fb93c8a919cc56",
"sha256:89409d369f4882c47f7ea20c42c5046879ce22c1e4ea20ef3b00a4dfc0a7f188",
"sha256:bf35a25f1aaa8a3781195595577fcbb59934856ee46b4f252f56ad12b8043bcf",
"sha256:de5303a6f1d0a7a34b9d40e4d3bef684ccc44a49bbe3eb85e3c0bffb4a131b7c"
],
"index": "pypi",
"version": "==2.9.1"
},
"py": {
"hashes": [
"sha256:21b81bda15b66ef5e1a777a21c4dcd9c20ad3efd0b3f817e7a809035269e1bd3",
"sha256:3b80836aa6d1feeaa108e046da6423ab8f6ceda6468545ae8d02d9d58d18818a"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==1.10.0"
},
"pycparser": {
"hashes": [
"sha256:2d475327684562c3a96cc71adf7dc8c4f0565175cf86b6d7a404ff4c771f15f0",
"sha256:7582ad22678f0fcd81102833f60ef8d0e57288b6b5fb00323d101be910e35705"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==2.20"
},
"pyjwt": {
"extras": [
"crypto"
],
"hashes": [
"sha256:934d73fbba91b0483d3857d1aff50e96b2a892384ee2c17417ed3203f173fca1",
"sha256:fba44e7898bbca160a2b2b501f492824fc8382485d3a6f11ba5d0c1937ce6130"
],
"index": "pypi",
"version": "==2.1.0"
},
"pyparsing": {
"hashes": [
"sha256:c203ec8783bf771a155b207279b9bccb8dea02d8f0c9e5f8ead507bc3246ecc1",
"sha256:ef9d7589ef3c200abe66653d3f1ab1033c3c419ae9b9bdb1240a85b024efc88b"
],
"markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==2.4.7"
},
"pytest": {
"hashes": [
"sha256:50bcad0a0b9c5a72c8e4e7c9855a3ad496ca6a881a3641b4260605450772c54b",
"sha256:91ef2131a9bd6be8f76f1f08eac5c5317221d6ad1e143ae03894b862e8976890"
],
"index": "pypi",
"version": "==6.2.4"
},
"requests": {
"hashes": [
"sha256:6c1246513ecd5ecd4528a0906f910e8f0f9c6b8ec72030dc9fd154dc1a6efd24",
"sha256:b8aa58f8cf793ffd8782d3d8cb19e66ef36f7aba4353eec859e74678b01b07a7"
],
"index": "pypi",
"version": "==2.26.0"
},
"toml": {
"hashes": [
"sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b",
"sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"
],
"markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==0.10.2"
},
"typing-extensions": {
"hashes": [
"sha256:0ac0f89795dd19de6b97debb0c6af1c70987fd80a2d62d1958f7e56fcc31b497",
"sha256:50b6f157849174217d0656f99dc82fe932884fb250826c18350e159ec6cdf342",
"sha256:779383f6086d90c99ae41cf0ff39aac8a7937a9283ce0a414e5dd782f4c94a84"
],
"index": "pypi",
"version": "==3.10.0.0"
},
"urllib3": {
"hashes": [
"sha256:39fb8672126159acb139a7718dd10806104dec1e2f0f6c88aab05d17df10c8d4",
"sha256:f57b4c16c62fa2760b7e3d97c35b255512fb6b59a259730f36ba32ce9f8e342f"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4' and python_version < '4'",
"version": "==1.26.6"
}
},
"develop": {
"flake8": {
"hashes": [
"sha256:07528381786f2a6237b061f6e96610a4167b226cb926e2aa2b6b1d78057c576b",
"sha256:bf8fd333346d844f616e8d47905ef3a3384edae6b4e9beb0c5101e25e3110907"
],
"index": "pypi",
"version": "==3.9.2"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"mypy": {
"hashes": [
"sha256:088cd9c7904b4ad80bec811053272986611b84221835e079be5bcad029e79dd9",
"sha256:0aadfb2d3935988ec3815952e44058a3100499f5be5b28c34ac9d79f002a4a9a",
"sha256:119bed3832d961f3a880787bf621634ba042cb8dc850a7429f643508eeac97b9",
"sha256:1a85e280d4d217150ce8cb1a6dddffd14e753a4e0c3cf90baabb32cefa41b59e",
"sha256:3c4b8ca36877fc75339253721f69603a9c7fdb5d4d5a95a1a1b899d8b86a4de2",
"sha256:3e382b29f8e0ccf19a2df2b29a167591245df90c0b5a2542249873b5c1d78212",
"sha256:42c266ced41b65ed40a282c575705325fa7991af370036d3f134518336636f5b",
"sha256:53fd2eb27a8ee2892614370896956af2ff61254c275aaee4c230ae771cadd885",
"sha256:704098302473cb31a218f1775a873b376b30b4c18229421e9e9dc8916fd16150",
"sha256:7df1ead20c81371ccd6091fa3e2878559b5c4d4caadaf1a484cf88d93ca06703",
"sha256:866c41f28cee548475f146aa4d39a51cf3b6a84246969f3759cb3e9c742fc072",
"sha256:a155d80ea6cee511a3694b108c4494a39f42de11ee4e61e72bc424c490e46457",
"sha256:adaeee09bfde366d2c13fe6093a7df5df83c9a2ba98638c7d76b010694db760e",
"sha256:b6fb13123aeef4a3abbcfd7e71773ff3ff1526a7d3dc538f3929a49b42be03f0",
"sha256:b94e4b785e304a04ea0828759172a15add27088520dc7e49ceade7834275bedb",
"sha256:c0df2d30ed496a08de5daed2a9ea807d07c21ae0ab23acf541ab88c24b26ab97",
"sha256:c6c2602dffb74867498f86e6129fd52a2770c48b7cd3ece77ada4fa38f94eba8",
"sha256:ceb6e0a6e27fb364fb3853389607cf7eb3a126ad335790fa1e14ed02fba50811",
"sha256:d9dd839eb0dc1bbe866a288ba3c1afc33a202015d2ad83b31e875b5905a079b6",
"sha256:e4dab234478e3bd3ce83bac4193b2ecd9cf94e720ddd95ce69840273bf44f6de",
"sha256:ec4e0cd079db280b6bdabdc807047ff3e199f334050db5cbb91ba3e959a67504",
"sha256:ecd2c3fe726758037234c93df7e98deb257fd15c24c9180dacf1ef829da5f921",
"sha256:ef565033fa5a958e62796867b1df10c40263ea9ded87164d67572834e57a174d"
],
"index": "pypi",
"version": "==0.910"
},
"mypy-extensions": {
"hashes": [
"sha256:090fedd75945a69ae91ce1303b5824f428daf5a028d2f6ab8a299250a846f15d",
"sha256:2d82818f5bb3e369420cb3c4060a7970edba416647068eb4c5343488a6c604a8"
],
"version": "==0.4.3"
},
"pycodestyle": {
"hashes": [
"sha256:514f76d918fcc0b55c6680472f0a37970994e07bbb80725808c17089be302068",
"sha256:c389c1d06bf7904078ca03399a4816f974a1d590090fecea0c63ec26ebaf1cef"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==2.7.0"
},
"pyflakes": {
"hashes": [
"sha256:7893783d01b8a89811dd72d7dfd4d84ff098e5eed95cfa8905b22bbffe52efc3",
"sha256:f5bc8ecabc05bb9d291eb5203d6810b49040f6ff446a756326104746cc00c1db"
],
"markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==2.3.1"
},
"toml": {
"hashes": [
"sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b",
"sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"
],
"markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
"version": "==0.10.2"
},
"typing-extensions": {
"hashes": [
"sha256:0ac0f89795dd19de6b97debb0c6af1c70987fd80a2d62d1958f7e56fcc31b497",
"sha256:50b6f157849174217d0656f99dc82fe932884fb250826c18350e159ec6cdf342",
"sha256:779383f6086d90c99ae41cf0ff39aac8a7937a9283ce0a414e5dd782f4c94a84"
],
"index": "pypi",
"version": "==3.10.0.0"
},
"yapf": {
"hashes": [
"sha256:408fb9a2b254c302f49db83c59f9aa0b4b0fd0ec25be3a5c51181327922ff63d",
"sha256:e3a234ba8455fe201eaa649cdac872d590089a18b661e39bbac7020978dd9c2e"
],
"index": "pypi",
"version": "==0.31.0"
}
}
}

View File

@@ -3,18 +3,13 @@
This directory contains integration tests.
Prerequisites:
- Python 3.6 or later
- Dependencies: install them via `pipenv install`. Note that Debian/Ubuntu
packages are stale, as it commonly happens, so manual installation is not
recommended.
Run `pipenv shell` to activate the venv or use `pipenv run` to run a single
command in the venv, e.g. `pipenv run pytest`.
- Correctly configured Python, see [`/docs/sourcetree.md`](/docs/sourcetree.md#using-python)
- Zenith and Postgres binaries
- See the root README.md for build directions
- See the root [README.md](/README.md) for build directions
- Tests can be run from the git tree; or see the environment variables
below to run from other directories.
- The zenith git repo, including the postgres submodule
(for some tests, e.g. pg_regress)
(for some tests, e.g. `pg_regress`)
### Test Organization
@@ -35,15 +30,15 @@ be stored under a directory `test_output`.
You can run all the tests with:
`pytest`
`pipenv run pytest`
If you want to run all the tests in a particular file:
`pytest test_pgbench.py`
`pipenv run pytest test_pgbench.py`
If you want to run all tests that have the string "bench" in their names:
`pytest -k bench`
`pipenv run pytest -k bench`
Useful environment variables:
@@ -53,8 +48,8 @@ Useful environment variables:
should go.
`TEST_SHARED_FIXTURES`: Try to re-use a single pageserver for all the tests.
Let stdout and stderr go to the terminal instead of capturing them:
`pytest -s ...`
Let stdout, stderr and `INFO` log messages go to the terminal instead of capturing them:
`pytest -s --log-cli-level=INFO ...`
(Note many tests capture subprocess outputs separately, so this may not
show much.)
@@ -62,44 +57,51 @@ Exit after the first test failure:
`pytest -x ...`
(there are many more pytest options; run `pytest -h` to see them.)
### Writing a test
### Building new tests
Every test needs a Zenith Environment, or ZenithEnv to operate in. A Zenith Environment
is like a little cloud-in-a-box, and consists of a Pageserver, 0-N Safekeepers, and
compute Postgres nodes. The connections between them can be configured to use JWT
authentication tokens, and some other configuration options can be tweaked too.
The tests make heavy use of pytest fixtures. You can read about how they work here: https://docs.pytest.org/en/stable/fixture.html
The easiest way to get access to a Zenith Environment is by using the `zenith_simple_env`
fixture. The 'simple' env may be shared across multiple tests, so don't shut down the nodes
or make other destructive changes in that environment. Also don't assume that
there are no tenants or branches or data in the cluster. For convenience, there is a
branch called `empty`, though. The convention is to create a test-specific branch of
that and load any test data there, instead of the 'main' branch.
Essentially, this means that each time you see a fixture named as an input parameter, the function with that name will be run and passed as a parameter to the function.
So this code:
For more complicated cases, you can build a custom Zenith Environment, with the `zenith_env`
fixture:
```python
def test_something(zenith_cli, pg_bin):
pass
def test_foobar(zenith_env_builder: ZenithEnvBuilder):
# Prescribe the environment.
# We want to have 3 safekeeper nodes, and use JWT authentication in the
# connections to the page server
zenith_env_builder.num_safekeepers = 3
zenith_env_builder.set_pageserver_auth(True)
# Now create the environment. This initializes the repository, and starts
# up the page server and the safekeepers
env = zenith_env_builder.init()
# Run the test
...
```
... will run the fixtures called `zenith_cli` and `pg_bin` and deliver those results to the test function.
For more information about pytest fixtures, see https://docs.pytest.org/en/stable/fixture.html
Fixtures can't be imported using the normal python syntax. Instead, use this:
At the end of a test, all the nodes in the environment are automatically stopped, so you
don't need to worry about cleaning up. Logs and test data are preserved for the analysis,
in a directory under `../test_output/<testname>`
```python
pytest_plugins = ("fixtures.something")
```
### Before submitting a patch
Ensure that you pass all [obligatory checks](/docs/sourcetree.md#obligatory-checks).
That will make all the fixtures in the `fixtures/something.py` file available.
Anything that's likely to be used in multiple tests should be built into a fixture.
Note that fixtures can clean up after themselves if they use the `yield` syntax.
Cleanup will happen even if the test fails (raises an unhandled exception).
Python destructors, e.g. `__del__()` aren't recommended for cleanup.
### Code quality
Before submitting a patch, please consider:
Also consider:
* Writing a couple of docstrings to clarify the reasoning behind a new test.
* Running `flake8` (or a linter of your choice, e.g. `pycodestyle`) and fixing possible defects, if any.
* Formatting the code with `yapf -r -i .` (TODO: implement an opt-in pre-commit hook for that).
* (Optional) Typechecking the code with `mypy .`. Currently this mostly affects `fixtures/zenith_fixtures.py`.
The tools can be installed with `pipenv install --dev`.
* Adding more type hints to your code to avoid `Any`, especially:
* For fixture parameters, they are not automatically deduced.
* For function arguments and return values.

View File

@@ -1,17 +1,22 @@
from contextlib import closing
from typing import Iterator
from uuid import uuid4
import psycopg2
from fixtures.zenith_fixtures import Postgres, ZenithCli, ZenithPageserver, PgBin
from fixtures.zenith_fixtures import ZenithEnvBuilder
import pytest
pytest_plugins = ("fixtures.zenith_fixtures")
def test_pageserver_auth(pageserver_auth_enabled: ZenithPageserver):
ps = pageserver_auth_enabled
tenant_token = ps.auth_keys.generate_tenant_token(ps.initial_tenant)
invalid_tenant_token = ps.auth_keys.generate_tenant_token(uuid4().hex)
management_token = ps.auth_keys.generate_management_token()
def test_pageserver_auth(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.pageserver_auth_enabled = True
env = zenith_env_builder.init()
ps = env.pageserver
tenant_token = env.auth_keys.generate_tenant_token(env.initial_tenant)
invalid_tenant_token = env.auth_keys.generate_tenant_token(uuid4().hex)
management_token = env.auth_keys.generate_management_token()
# this does not invoke auth check and only decodes jwt and checks it for validity
# check both tokens
@@ -19,56 +24,41 @@ def test_pageserver_auth(pageserver_auth_enabled: ZenithPageserver):
ps.safe_psql("status", password=management_token)
# tenant can create branches
ps.safe_psql(f"branch_create {ps.initial_tenant} new1 main", password=tenant_token)
ps.safe_psql(f"branch_create {env.initial_tenant} new1 main", password=tenant_token)
# console can create branches for tenant
ps.safe_psql(f"branch_create {ps.initial_tenant} new2 main", password=management_token)
ps.safe_psql(f"branch_create {env.initial_tenant} new2 main", password=management_token)
# fail to create branch using token with different tenantid
with pytest.raises(psycopg2.DatabaseError, match='Tenant id mismatch. Permission denied'):
ps.safe_psql(f"branch_create {ps.initial_tenant} new2 main", password=invalid_tenant_token)
ps.safe_psql(f"branch_create {env.initial_tenant} new2 main", password=invalid_tenant_token)
# create tenant using management token
ps.safe_psql(f"tenant_create {uuid4().hex}", password=management_token)
# fail to create tenant using tenant token
with pytest.raises(psycopg2.DatabaseError, match='Attempt to access management api with tenant scope. Permission denied'):
with pytest.raises(
psycopg2.DatabaseError,
match='Attempt to access management api with tenant scope. Permission denied'):
ps.safe_psql(f"tenant_create {uuid4().hex}", password=tenant_token)
@pytest.mark.parametrize('with_wal_acceptors', [False, True])
def test_compute_auth_to_pageserver(
zenith_cli: ZenithCli,
wa_factory,
pageserver_auth_enabled: ZenithPageserver,
repo_dir: str,
with_wal_acceptors: bool,
pg_bin: PgBin
):
ps = pageserver_auth_enabled
# since we are in progress of refactoring protocols between compute safekeeper and page server
# use hardcoded management token in safekeeper
management_token = ps.auth_keys.generate_management_token()
def test_compute_auth_to_pageserver(zenith_env_builder: ZenithEnvBuilder, with_wal_acceptors: bool):
zenith_env_builder.pageserver_auth_enabled = True
if with_wal_acceptors:
zenith_env_builder.num_safekeepers = 3
env = zenith_env_builder.init()
branch = f"test_compute_auth_to_pageserver{with_wal_acceptors}"
zenith_cli.run(["branch", branch, "empty"])
if with_wal_acceptors:
wa_factory.start_n_new(3, management_token)
env.zenith_cli(["branch", branch, "main"])
with Postgres(
zenith_cli=zenith_cli,
repo_dir=repo_dir,
pg_bin=pg_bin,
tenant_id=ps.initial_tenant,
port=55432, # FIXME port distribution is hardcoded in tests and in cli
).create_start(
branch,
wal_acceptors=wa_factory.get_connstrs() if with_wal_acceptors else None,
) as pg:
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
# we rely upon autocommit after each statement
# as waiting for acceptors happens there
cur.execute('CREATE TABLE t(key int primary key, value text)')
cur.execute("INSERT INTO t SELECT generate_series(1,100000), 'payload'")
cur.execute('SELECT sum(key) FROM t')
assert cur.fetchone() == (5000050000, )
pg = env.postgres.create_start(branch)
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
# we rely upon autocommit after each statement
# as waiting for acceptors happens there
cur.execute('CREATE TABLE t(key int primary key, value text)')
cur.execute("INSERT INTO t SELECT generate_series(1,100000), 'payload'")
cur.execute('SELECT sum(key) FROM t')
assert cur.fetchone() == (5000050000, )

View File

@@ -1,6 +1,11 @@
import subprocess
from fixtures.zenith_fixtures import PostgresFactory, ZenithPageserver
from contextlib import closing
import psycopg2.extras
import pytest
from fixtures.log_helper import log
from fixtures.utils import print_gc_result
from fixtures.zenith_fixtures import ZenithEnv
pytest_plugins = ("fixtures.zenith_fixtures")
@@ -8,18 +13,27 @@ pytest_plugins = ("fixtures.zenith_fixtures")
#
# Create a couple of branches off the main branch, at a historical point in time.
#
def test_branch_behind(zenith_cli, pageserver: ZenithPageserver, postgres: PostgresFactory, pg_bin):
def test_branch_behind(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
# Branch at the point where only 100 rows were inserted
zenith_cli.run(["branch", "test_branch_behind", "empty"])
env.zenith_cli(["branch", "test_branch_behind", "empty"])
pgmain = postgres.create_start('test_branch_behind')
print("postgres is running on 'test_branch_behind' branch")
pgmain = env.postgres.create_start('test_branch_behind')
log.info("postgres is running on 'test_branch_behind' branch")
main_pg_conn = pgmain.connect()
main_cur = main_pg_conn.cursor()
main_cur.execute("SHOW zenith.zenith_timeline")
timeline = main_cur.fetchone()[0]
# Create table, and insert the first 100 rows
main_cur.execute('CREATE TABLE foo (t text)')
# keep some early lsn to test branch creation on out of date lsn
main_cur.execute('SELECT pg_current_wal_insert_lsn()')
gced_lsn = main_cur.fetchone()[0]
main_cur.execute('''
INSERT INTO foo
SELECT 'long string to consume some space' || g
@@ -27,38 +41,38 @@ def test_branch_behind(zenith_cli, pageserver: ZenithPageserver, postgres: Postg
''')
main_cur.execute('SELECT pg_current_wal_insert_lsn()')
lsn_a = main_cur.fetchone()[0]
print('LSN after 100 rows: ' + lsn_a)
log.info(f'LSN after 100 rows: {lsn_a}')
# Insert some more rows. (This generates enough WAL to fill a few segments.)
main_cur.execute('''
INSERT INTO foo
SELECT 'long string to consume some space' || g
FROM generate_series(1, 100000) g
FROM generate_series(1, 200000) g
''')
main_cur.execute('SELECT pg_current_wal_insert_lsn()')
lsn_b = main_cur.fetchone()[0]
print('LSN after 100100 rows: ' + lsn_b)
log.info(f'LSN after 200100 rows: {lsn_b}')
# Branch at the point where only 100 rows were inserted
zenith_cli.run(["branch", "test_branch_behind_hundred", "test_branch_behind@" + lsn_a])
env.zenith_cli(["branch", "test_branch_behind_hundred", "test_branch_behind@" + lsn_a])
# Insert many more rows. This generates enough WAL to fill a few segments.
main_cur.execute('''
INSERT INTO foo
SELECT 'long string to consume some space' || g
FROM generate_series(1, 100000) g
FROM generate_series(1, 200000) g
''')
main_cur.execute('SELECT pg_current_wal_insert_lsn()')
main_cur.execute('SELECT pg_current_wal_insert_lsn()')
lsn_c = main_cur.fetchone()[0]
print('LSN after 200100 rows: ' + lsn_c)
log.info(f'LSN after 400100 rows: {lsn_c}')
# Branch at the point where only 200 rows were inserted
zenith_cli.run(["branch", "test_branch_behind_more", "test_branch_behind@" + lsn_b])
# Branch at the point where only 200100 rows were inserted
env.zenith_cli(["branch", "test_branch_behind_more", "test_branch_behind@" + lsn_b])
pg_hundred = postgres.create_start("test_branch_behind_hundred")
pg_more = postgres.create_start("test_branch_behind_more")
pg_hundred = env.postgres.create_start("test_branch_behind_hundred")
pg_more = env.postgres.create_start("test_branch_behind_more")
# On the 'hundred' branch, we should see only 100 rows
hundred_pg_conn = pg_hundred.connect()
@@ -70,23 +84,43 @@ def test_branch_behind(zenith_cli, pageserver: ZenithPageserver, postgres: Postg
more_pg_conn = pg_more.connect()
more_cur = more_pg_conn.cursor()
more_cur.execute('SELECT count(*) FROM foo')
assert more_cur.fetchone() == (100100, )
assert more_cur.fetchone() == (200100, )
# All the rows are visible on the main branch
main_cur.execute('SELECT count(*) FROM foo')
assert main_cur.fetchone() == (200100, )
assert main_cur.fetchone() == (400100, )
# Check bad lsn's for branching
# branch at segment boundary
zenith_cli.run(["branch", "test_branch_segment_boundary", "test_branch_behind@0/3000000"])
pg = postgres.create_start("test_branch_segment_boundary")
env.zenith_cli(["branch", "test_branch_segment_boundary", "test_branch_behind@0/3000000"])
pg = env.postgres.create_start("test_branch_segment_boundary")
cur = pg.connect().cursor()
cur.execute('SELECT 1')
assert cur.fetchone() == (1, )
# branch at pre-initdb lsn
try:
zenith_cli.run(["branch", "test_branch_preinitdb", "test_branch_behind@0/42"])
except subprocess.CalledProcessError:
print("Branch creation with pre-initdb LSN failed (as expected)")
with pytest.raises(Exception, match="invalid branch start lsn"):
env.zenith_cli(["branch", "test_branch_preinitdb", "test_branch_behind@0/42"])
# check that we cannot create branch based on garbage collected data
with closing(env.pageserver.connect()) as psconn:
with psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) as pscur:
# call gc to advace latest_gc_cutoff_lsn
pscur.execute(f"do_gc {env.initial_tenant} {timeline} 0")
row = pscur.fetchone()
print_gc_result(row)
with pytest.raises(Exception, match="invalid branch start lsn"):
# this gced_lsn is pretty random, so if gc is disabled this woudln't fail
env.zenith_cli(["branch", "test_branch_create_fail", f"test_branch_behind@{gced_lsn}"])
# check that after gc everything is still there
hundred_cur.execute('SELECT count(*) FROM foo')
assert hundred_cur.fetchone() == (100, )
more_cur.execute('SELECT count(*) FROM foo')
assert more_cur.fetchone() == (200100, )
main_cur.execute('SELECT count(*) FROM foo')
assert main_cur.fetchone() == (400100, )

View File

@@ -3,7 +3,8 @@ import os
from contextlib import closing
from fixtures.zenith_fixtures import PostgresFactory, ZenithPageserver, check_restored_datadir_content
from fixtures.zenith_fixtures import ZenithEnv
from fixtures.log_helper import log
pytest_plugins = ("fixtures.zenith_fixtures")
@@ -11,20 +12,24 @@ pytest_plugins = ("fixtures.zenith_fixtures")
#
# Test compute node start after clog truncation
#
def test_clog_truncate(zenith_cli, pageserver: ZenithPageserver, postgres: PostgresFactory, pg_bin):
def test_clog_truncate(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
# Create a branch for us
zenith_cli.run(["branch", "test_clog_truncate", "empty"])
env.zenith_cli(["branch", "test_clog_truncate", "empty"])
# set agressive autovacuum to make sure that truncation will happen
config = [
'autovacuum_max_workers=10', 'autovacuum_vacuum_threshold=0',
'autovacuum_vacuum_insert_threshold=0', 'autovacuum_vacuum_cost_delay=0',
'autovacuum_vacuum_cost_limit=10000', 'autovacuum_naptime =1s',
'autovacuum_max_workers=10',
'autovacuum_vacuum_threshold=0',
'autovacuum_vacuum_insert_threshold=0',
'autovacuum_vacuum_cost_delay=0',
'autovacuum_vacuum_cost_limit=10000',
'autovacuum_naptime =1s',
'autovacuum_freeze_max_age=100000'
]
pg = postgres.create_start('test_clog_truncate', config_lines=config)
print('postgres is running on test_clog_truncate branch')
pg = env.postgres.create_start('test_clog_truncate', config_lines=config)
log.info('postgres is running on test_clog_truncate branch')
# Install extension containing function needed for test
pg.safe_psql('CREATE EXTENSION zenith_test_utils')
@@ -33,22 +38,22 @@ def test_clog_truncate(zenith_cli, pageserver: ZenithPageserver, postgres: Postg
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
cur.execute('select test_consume_xids(1000*1000*10);')
print('xids consumed')
log.info('xids consumed')
# call a checkpoint to trigger TruncateSubtrans
cur.execute('CHECKPOINT;')
# ensure WAL flush
cur.execute('select txid_current()')
print(cur.fetchone())
log.info(cur.fetchone())
# wait for autovacuum to truncate the pg_xact
# XXX Is it worth to add a timeout here?
pg_xact_0000_path = os.path.join(pg.pg_xact_dir_path(), '0000')
print("pg_xact_0000_path = " + pg_xact_0000_path)
log.info(f"pg_xact_0000_path = {pg_xact_0000_path}")
while os.path.isfile(pg_xact_0000_path):
print("file exists. wait for truncation. " "pg_xact_0000_path = " + pg_xact_0000_path)
log.info(f"file exists. wait for truncation. " "pg_xact_0000_path = {pg_xact_0000_path}")
time.sleep(5)
# checkpoint to advance latest lsn
@@ -59,14 +64,14 @@ def test_clog_truncate(zenith_cli, pageserver: ZenithPageserver, postgres: Postg
lsn_after_truncation = cur.fetchone()[0]
# create new branch after clog truncation and start a compute node on it
print('create branch at lsn_after_truncation ' + lsn_after_truncation)
zenith_cli.run(
log.info(f'create branch at lsn_after_truncation {lsn_after_truncation}')
env.zenith_cli(
["branch", "test_clog_truncate_new", "test_clog_truncate@" + lsn_after_truncation])
pg2 = postgres.create_start('test_clog_truncate_new')
print('postgres is running on test_clog_truncate_new branch')
pg2 = env.postgres.create_start('test_clog_truncate_new')
log.info('postgres is running on test_clog_truncate_new branch')
# check that new node doesn't contain truncated segment
pg_xact_0000_path_new = os.path.join(pg2.pg_xact_dir_path(), '0000')
print("pg_xact_0000_path_new = " + pg_xact_0000_path_new)
log.info(f"pg_xact_0000_path_new = {pg_xact_0000_path_new}")
assert os.path.isfile(pg_xact_0000_path_new) is False

View File

@@ -1,6 +1,7 @@
from contextlib import closing
from fixtures.zenith_fixtures import PostgresFactory, ZenithPageserver
from fixtures.zenith_fixtures import ZenithEnv
from fixtures.log_helper import log
pytest_plugins = ("fixtures.zenith_fixtures")
@@ -8,13 +9,14 @@ pytest_plugins = ("fixtures.zenith_fixtures")
#
# Test starting Postgres with custom options
#
def test_config(zenith_cli, pageserver: ZenithPageserver, postgres: PostgresFactory, pg_bin):
def test_config(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
# Create a branch for us
zenith_cli.run(["branch", "test_config", "empty"])
env.zenith_cli(["branch", "test_config", "empty"])
# change config
pg = postgres.create_start('test_config', config_lines=['log_min_messages=debug1'])
print('postgres is running on test_config branch')
pg = env.postgres.create_start('test_config', config_lines=['log_min_messages=debug1'])
log.info('postgres is running on test_config branch')
with closing(pg.connect()) as conn:
with conn.cursor() as cur:

View File

@@ -2,7 +2,8 @@ import os
import pathlib
from contextlib import closing
from fixtures.zenith_fixtures import ZenithPageserver, PostgresFactory, ZenithCli, check_restored_datadir_content
from fixtures.zenith_fixtures import ZenithEnv, check_restored_datadir_content
from fixtures.log_helper import log
pytest_plugins = ("fixtures.zenith_fixtures")
@@ -10,16 +11,12 @@ pytest_plugins = ("fixtures.zenith_fixtures")
#
# Test CREATE DATABASE when there have been relmapper changes
#
def test_createdb(
zenith_cli: ZenithCli,
pageserver: ZenithPageserver,
postgres: PostgresFactory,
pg_bin,
):
zenith_cli.run(["branch", "test_createdb", "empty"])
def test_createdb(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
env.zenith_cli(["branch", "test_createdb", "empty"])
pg = postgres.create_start('test_createdb')
print("postgres is running on 'test_createdb' branch")
pg = env.postgres.create_start('test_createdb')
log.info("postgres is running on 'test_createdb' branch")
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
@@ -32,27 +29,24 @@ def test_createdb(
lsn = cur.fetchone()[0]
# Create a branch
zenith_cli.run(["branch", "test_createdb2", "test_createdb@" + lsn])
env.zenith_cli(["branch", "test_createdb2", "test_createdb@" + lsn])
pg2 = postgres.create_start('test_createdb2')
pg2 = env.postgres.create_start('test_createdb2')
# Test that you can connect to the new database on both branches
for db in (pg, pg2):
db.connect(dbname='foodb').close()
#
# Test DROP DATABASE
#
def test_dropdb(
zenith_cli: ZenithCli,
pageserver: ZenithPageserver,
postgres: PostgresFactory,
pg_bin,
):
zenith_cli.run(["branch", "test_dropdb", "empty"])
def test_dropdb(zenith_simple_env: ZenithEnv, test_output_dir):
env = zenith_simple_env
env.zenith_cli(["branch", "test_dropdb", "empty"])
pg = postgres.create_start('test_dropdb')
print("postgres is running on 'test_dropdb' branch")
pg = env.postgres.create_start('test_dropdb')
log.info("postgres is running on 'test_dropdb' branch")
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
@@ -64,7 +58,6 @@ def test_dropdb(
cur.execute("SELECT oid FROM pg_database WHERE datname='foodb';")
dboid = cur.fetchone()[0]
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
cur.execute('DROP DATABASE foodb')
@@ -74,28 +67,29 @@ def test_dropdb(
cur.execute('SELECT pg_current_wal_insert_lsn()')
lsn_after_drop = cur.fetchone()[0]
# Create two branches before and after database drop.
zenith_cli.run(["branch", "test_before_dropdb", "test_dropdb@" + lsn_before_drop])
pg_before = postgres.create_start('test_before_dropdb')
env.zenith_cli(["branch", "test_before_dropdb", "test_dropdb@" + lsn_before_drop])
pg_before = env.postgres.create_start('test_before_dropdb')
zenith_cli.run(["branch", "test_after_dropdb", "test_dropdb@" + lsn_after_drop])
pg_after = postgres.create_start('test_after_dropdb')
env.zenith_cli(["branch", "test_after_dropdb", "test_dropdb@" + lsn_after_drop])
pg_after = env.postgres.create_start('test_after_dropdb')
# Test that database exists on the branch before drop
pg_before.connect(dbname='foodb').close()
# Test that database subdir exists on the branch before drop
assert pg_before.pgdata_dir
dbpath = pathlib.Path(pg_before.pgdata_dir) / 'base' / str(dboid)
print(dbpath)
log.info(dbpath)
assert os.path.isdir(dbpath) == True
# Test that database subdir doesn't exist on the branch after drop
assert pg_after.pgdata_dir
dbpath = pathlib.Path(pg_after.pgdata_dir) / 'base' / str(dboid)
print(dbpath)
log.info(dbpath)
assert os.path.isdir(dbpath) == False
# Check that we restore the content of the datadir correctly
check_restored_datadir_content(zenith_cli, pg, lsn_after_drop, postgres)
check_restored_datadir_content(test_output_dir, env, pg)

View File

@@ -1,6 +1,7 @@
from contextlib import closing
from fixtures.zenith_fixtures import PostgresFactory, ZenithPageserver
from fixtures.zenith_fixtures import ZenithEnv
from fixtures.log_helper import log
pytest_plugins = ("fixtures.zenith_fixtures")
@@ -8,11 +9,12 @@ pytest_plugins = ("fixtures.zenith_fixtures")
#
# Test CREATE USER to check shared catalog restore
#
def test_createuser(zenith_cli, pageserver: ZenithPageserver, postgres: PostgresFactory, pg_bin):
zenith_cli.run(["branch", "test_createuser", "empty"])
def test_createuser(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
env.zenith_cli(["branch", "test_createuser", "empty"])
pg = postgres.create_start('test_createuser')
print("postgres is running on 'test_createuser' branch")
pg = env.postgres.create_start('test_createuser')
log.info("postgres is running on 'test_createuser' branch")
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
@@ -25,9 +27,9 @@ def test_createuser(zenith_cli, pageserver: ZenithPageserver, postgres: Postgres
lsn = cur.fetchone()[0]
# Create a branch
zenith_cli.run(["branch", "test_createuser2", "test_createuser@" + lsn])
env.zenith_cli(["branch", "test_createuser2", "test_createuser@" + lsn])
pg2 = postgres.create_start('test_createuser2')
pg2 = env.postgres.create_start('test_createuser2')
# Test that you can connect to new branch as a new user
assert pg2.safe_psql('select current_user', username='testuser') == [('testuser', )]

View File

@@ -1,4 +1,5 @@
from fixtures.zenith_fixtures import PostgresFactory, ZenithPageserver, check_restored_datadir_content
from fixtures.zenith_fixtures import ZenithEnv, check_restored_datadir_content
from fixtures.log_helper import log
pytest_plugins = ("fixtures.zenith_fixtures")
@@ -9,12 +10,13 @@ pytest_plugins = ("fixtures.zenith_fixtures")
# it only checks next_multixact_id field in restored pg_control,
# since we don't have functions to check multixact internals.
#
def test_multixact(pageserver: ZenithPageserver, postgres: PostgresFactory, pg_bin, zenith_cli, base_dir):
def test_multixact(zenith_simple_env: ZenithEnv, test_output_dir):
env = zenith_simple_env
# Create a branch for us
zenith_cli.run(["branch", "test_multixact", "empty"])
pg = postgres.create_start('test_multixact')
env.zenith_cli(["branch", "test_multixact", "empty"])
pg = env.postgres.create_start('test_multixact')
print("postgres is running on 'test_multixact' branch")
log.info("postgres is running on 'test_multixact' branch")
pg_conn = pg.connect()
cur = pg_conn.cursor()
@@ -51,10 +53,10 @@ def test_multixact(pageserver: ZenithPageserver, postgres: PostgresFactory, pg_b
assert int(next_multixact_id) > int(next_multixact_id_old)
# Branch at this point
zenith_cli.run(["branch", "test_multixact_new", "test_multixact@" + lsn])
pg_new = postgres.create_start('test_multixact_new')
env.zenith_cli(["branch", "test_multixact_new", "test_multixact@" + lsn])
pg_new = env.postgres.create_start('test_multixact_new')
print("postgres is running on 'test_multixact_new' branch")
log.info("postgres is running on 'test_multixact_new' branch")
pg_new_conn = pg_new.connect()
cur_new = pg_new_conn.cursor()
@@ -65,4 +67,4 @@ def test_multixact(pageserver: ZenithPageserver, postgres: PostgresFactory, pg_b
assert next_multixact_id_new == next_multixact_id
# Check that we restore the content of the datadir correctly
check_restored_datadir_content(zenith_cli, pg, lsn, postgres)
check_restored_datadir_content(test_output_dir, env, pg_new)

Some files were not shown because too many files have changed in this diff Show More