Commit Graph

82 Commits

Author SHA1 Message Date
Arthur Petukhovsky
70778058d9 Add test for safekeeper setup without pageserver (#1000) 2021-12-29 12:58:27 +03:00
anastasia
5ef2b1baf7 Add new test illustrating issue with sync-safekeepers.
If safekeepers sync fast enough, callmemaybe thread may never make a call before receiving Unsubscribe request. This leads to the situation, when pageserver lacks data that exists on safekeepers.
2021-12-28 17:50:48 +03:00
anastasia
980f5f8440 Propagate remote_consistent_lsn to safekeepers.
Change meaning of lsns in HOT_STANDBY_FEEDBACK:
flush_lsn = disk_consistent_lsn,
apply_lsn = remote_consistent_lsn
Update compute node backpressure configuration respectively.

Update compute node configuration:
set 'synchronous_commit=remote_write' in setup without safekeepers.
This way compute node doesn't have to wait for data checkpoint on pageserver.
This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.
2021-12-24 15:32:54 +03:00
Heikki Linnakangas
1cc181ca32 Fix WAL redo of commit records with subtransactions.
If a commit record contains XIDs that are stored on different CLOG pages,
we duplicate the commit record for each affected CLOG page. In the redo
routine, we must only apply the parts of the record that apply to the
CLOG page being restored. We got that right in the loop that handles the
sub-XIDs, but incorrectly always set the bit that corresponds to the main
XID.
2021-12-21 23:08:01 +02:00
Heikki Linnakangas
927587cec8 Fix comments in tests 2021-12-21 22:38:33 +02:00
Heikki Linnakangas
bcf80eaa95 Fix multixacts members WAL redo.
The logic to compute the page number was broken, and as a result, only
the first page of multixact members was updated correctly. All the
rest were left as zeros. Improve test_multixact.py to generate more
multixacts, to cover this case.

Also fix the check that the restored PG data directory matches the
original one. Previously, the test compared the 'pg_new' cluster,
which is a bit silly because the test restored the 'pg_new' cluster
only a few lines earlier, so if the multixact WAL redo is somehow
broken, the comparison will just compare two broken data directories
and report success. Change it to compare the original datadir, the one
where the multixacts were originally created, with a restored image of
the same.
2021-12-21 17:50:06 +02:00
Kirill Bulatov
673c297949 Download timelines on demand 2021-12-10 17:23:35 +02:00
Dmitry Rodionov
557e3024cd Forward pageserver connection string from compute to safekeeper
This is needed for implementation of tenant rebalancing. With this
change safekeeper becomes aware of which pageserver is supposed to be
used for replication from this particular compute.
2021-12-06 21:28:49 +03:00
Arseny Sher
cba4da3f4d Add term history to safekeepers.
Persist full history of term switches on safekeepers instead of storing only the
single term of the highest entry (called epoch). This allows easily and
correctly find the divergence point of two logs and truncate the obsolete part
before overwriting it with entries of the newer proposer(s).

Full history of the proposer is transferred in separate message before proposer
starts streaming; it is immediately persisted by safekeeper, though he might not
yet have entries for some older terms there. That's because we can't atomically
append to WAL and update the control file anyway, so locally available WAL must
be taken into account when looking at the history.

We should sometimes purge term history entries beyond truncate_lsn; this is not
done here.

Per https://github.com/zenithdb/rfcs/pull/12

Closes #296.

Bumps vendor/postgres.
2021-12-03 12:43:57 +03:00
Dmitry Rodionov
130184fee9 Prohibit branch creation and basebackup at out of scope lsns
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.

To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
  when basebackup retured an error. Previously pageserver thread just
  died.
* Remove "ancestor" file which previously contained ancestor id and
  branch lsn. Currently the same data can be obtained from metadata file.
  And just the way we handled ancestor file in the code introduced the
  case when branching fails timeline directory is created but there is no data in it
  except ancestor file. And this confused gc because it scans
  directories. So it is better to just remove ancestor file and clean up
  this timeline directory creation so it happens after all validity
  checks have passed
2021-11-25 15:27:16 +03:00
Dmitry Rodionov
737a557f09 add check to python tests that afteer gc number of rows is unchanged in all branches 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
44111e3ba3 Prohibit branch creation at lsn that was already garbage collected.
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.
2021-11-15 20:03:16 +03:00
Heikki Linnakangas
849ac791a6 Bandaid fix for "page not found" errors, when a table is loaded.
During parallel load of a table, Postgres sometimes requests a page from
the page server for which no WAL has been generated yet. That's normal;
Postgres expects the page to be full of zeros. There was a special case
for that in LayeredTimeline::materialize_page, but the problem remained
when you're crossing a segment boundary, so that there's no layer for
the segment at all.

It would be nice to have a more robust cross-check for this case. That
might need help from the Postgres side. But this extends the bandaid fix
we had in materialize_page() to the case where cross segment boundary.

Fixes https://github.com/zenithdb/zenith/issues/841
2021-11-12 18:47:39 +02:00
Arthur Petukhovsky
9aaa02bc9a Fix high CPU usage in walproposer (#860)
* Bump vendor/postgres

* Update time limits for test_restarts_under_load
2021-11-10 17:18:07 +03:00
Egor Suvorov
587935ebed Add Safekeeper metrics tests (#746)
* zenith_fixtures.py: add SafekeeperHttpClient.get_metrics()
* Ensure that `collect_lsn` and `flush_lsn`'s reported values look reasonable in `test_many_timelines`
2021-11-09 22:18:59 +03:00
Arthur Petukhovsky
d19263aec8 Adjust timeouts for test_restarts_under_load (#830)
* Adjust timeouts for test_restarts_under_load

* Add test timeout for test_restarts_under_load
2021-11-04 19:58:40 +03:00
Heikki Linnakangas
66ec135676 Refactor pytest fixtures
Instead of having a lot of separate fixtures for setting up the page
server, the compute nodes, the safekeepers etc., have one big ZenithEnv
object that encapsulates the whole environment. Every test either uses
a shared "zenith_simple_env" fixture, which contains the default setup
of a pageserver with no authentication, and no safekeepers. Tests that
want to use safekeepers or authentication set up a custom test-specific
ZenithEnv fixture.

Gathering information about the whole environment into one object makes
some things simpler. For example, when a new compute node is created,
you no longer need to pass the 'wal_acceptors' connection string as
argument to the 'postgres.create_start' function. The 'create_start'
function fetches that information directly from the ZenithEnv object.
2021-10-25 14:14:47 +03:00
Heikki Linnakangas
28af3e5008 Remove some unnecessary fixture arguments 2021-10-25 14:14:45 +03:00
Heikki Linnakangas
57ce541521 Remove unnecessary 'pg_bin' object from 'postgres' fixture.
It was only used in check_restored_datadir_content(), and that function
can construct it easily from the other information it has.
2021-10-25 14:14:41 +03:00
Egor Suvorov
ff563ff080 test_runner: fix mypy errors and force it on CI (#774)
* Fix bugs found by mypy
* Add some missing types and runtime checks, remove unused code
* Make ZenithPageserver start right away for better type safety
* Add `types-*` packages to Pipfile
* Pin mypy version and run it on CircleCI
2021-10-21 13:51:54 +03:00
anastasia
7f9d2a7d05 Change 'zenith tenant list' API to return tenant state added in 0dc7a3fc 2021-10-21 11:04:22 +03:00
Arthur Petukhovsky
13f4e173c9 Wait for safekeepers to catch up in test_restarts_under_load (#776) 2021-10-20 14:42:53 +03:00
Egor Suvorov
eb706bc9f4 Force yapf (Python code formatter) in CI (#772)
* Add yapf run to CircleCI
* Pin yapf version
* Enable `SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES` setting
* Reformat all existing code with slight manual adjustments
* test_runner/README: note that yapf is forced
2021-10-19 20:13:47 +03:00
Heikki Linnakangas
feae7f39c1 Support read-only nodes
Change 'zenith.signal' file to a human-readable format, similar to
backup_label. It can contain a "PREV LSN: %X/%X" line, or a special
value to indicate that it's OK to start with invalid LSN ('none'), or
that it's a read-only node and generating WAL is forbidden
('invalid').

The 'zenith pg create' and 'zenith pg start' commands now take a node
name parameter, separate from the branch name. If the node name is not
given, it defaults to the branch name, so this doesn't break existing
scripts.

If you pass "foo@<lsn>" as the branch name, a read-only node anchored
at that LSN is created. The anchoring is performed by setting the
'recovery_target_lsn' option in the postgresql.conf file, and putting
the server into standby mode with 'standby.signal'.

We no longer store the synthetic checkpoint record in the WAL segment.
The postgres startup code has been changed to use the copy of the
checkpoint record in the pg_control file, when starting in zenith
mode.
2021-10-19 09:48:12 +03:00
Heikki Linnakangas
c2b468c958 Separate node name from the branch name in ComputeControlPlane
This is in preparation for supporting read-only nodes. You can launch
multiple read-only nodes on the same brach, so we need an identifier
for each node, separate from the branch name.
2021-10-19 09:48:10 +03:00
Arseny Sher
de744a44dd Add /timeline http request to safekeeper returning its status.
Which is mainly generational state (terms) and useful LSNs.

Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes #726.

ref #115
2021-10-14 19:02:38 +03:00
Arthur Petukhovsky
4b87acb1f6 Use logging in python tests (#674)
* Use logging in python tests

* Use f-strings for logs

* Don't log test output while running

* Use only pytest logging handler

* Add more info about pytest logging
2021-10-14 13:10:09 +03:00
Heikki Linnakangas
e3945d94fd Store unlogged tables locally, and replace PD_WAL_LOGGED.
All the changes are in the vendor/postgres side. However, because we now
generate fewer Full Page Writes, the 'branch_behind' test needs to be
modified so that it still generates enough WAL to consume a few WAL
segments.
2021-10-06 10:58:15 +03:00
Arseny Sher
4256231eb7 Enable test_start_compute with safekeepers.
It should work now.
2021-10-04 16:50:46 +03:00
Arthur Petukhovsky
d6fc74a412 Various fixes for test_sync_safekeepers (#668)
* Send ProposerGreeting manually in tests

* Move test_sync_safekeepers to test_wal_acceptor.py

* Capture test_sync_safekeepers output

* Add comment for handle_json_ctrl

* Save captured output in CI
2021-09-28 19:25:05 +03:00
Arseny Sher
7a370394a7 Wait till previous victim recovers in run_restarts_under_load.
Fixes test flakiness, as recovery easily might take the whole iteration.
2021-09-28 19:15:41 +03:00
Arseny Sher
70b08923ed Disable new safekeepers tests as not stable enough. 2021-09-26 22:33:58 +03:00
Heikki Linnakangas
ff5cbe2694 Support overlapping and nested Layers in the layer map.
This introduces a new tree data structure for holding intervals, and
queries of the form "which intervals contain the given point?". It then
uses that to store the Layers in the layer map, instead of the BTreeMap.

While we don't currently create overlapping layers in the page server,
that situation might arise in the future if we start to create extra
layers for performance purposes, or as part of some multi-stage
garbage collection operation that creates new layers in some interval
and then removes old ones. The situation might also arise if you have
multiple page servers running on the same timeline, freezing layers at
different points, and both uploading them to S3.

So even though overlapping layers might not happen currently, let's
avoid getting confused if it does happen for some reason.

Fixes https://github.com/zenithdb/zenith/issues/517.
2021-09-24 14:10:52 +03:00
Heikki Linnakangas
2319e0ec8f Define a layer's start and end bounds more precisely.
After this, a layer's start bound is always defined to be inclusive, and
end bound exclusive.

For example, if you have a layer in the range 100-200, that layer can be
used for GetPage@LSN requests at LSN 100, 199, or anything in between.
But for LSN 200, you need to look at the next layer (if one exists).

This is one part of a fix for https://github.com/zenithdb/zenith/issues/517.
After this, the page server shouldn't create layers for the same segment
with the same LSN, which avoids the issue. However, the same thing would
still happen, if you managed to create layers with same start LSN again.
That could happen e.g. if you had two page servers running, or in some
weird crash/restart scenario, or due to bugs or features added later. The
next commit makes the layer map more robust, so that it tolerates that
situation without deleting wrong files.
2021-09-24 14:10:49 +03:00
Arthur Petukhovsky
d4e037f1e7 Support for --sync-safekeepers in tests (#647)
New command has been added to append specially crafted records in safekeeper WAL. This command takes json for append, encodes LogicalMessage based on json fields, and processes new AppendRequest to append and commit WAL in safekeeper.

Python test starts up walkeepers and creates config for walproposer, then appends WAL and checks --sync-safekeepers works without errors. This test is simplest one, more useful test cases (like in #545) for different setups will be added soon.
2021-09-24 13:19:59 +03:00
anastasia
a4fc6da57b Fix gc_internal to treat dropped layers.
Some dropped layers serve as tombstones for earlier layers and thus cannot be garbage collected.
Add new fields to GcResult for layers that are preserved as tombstones
2021-09-23 12:21:47 +03:00
Arthur Petukhovsky
8ebf2fe550 Add test for acceptor restarts under load (#591)
In this test safekeepers are restarted one by one, while bank transactions
are executed and validated in the background. Bank transactions consist of
balance transfers and log writes. In the end balance sum should remain the
same and there should be progress from every client, when 2 of 3 safekeeper
nodes are up.
2021-09-22 11:59:20 +03:00
Heikki Linnakangas
540973eac4 Don't get confused on request of latest page version with very old LSN.
If the 'latest' flag in the client request is true, the client wants the
latest page version regardless of the LSN in the request. The LSN is just
a hint in that case, indicating that the page hasn't been modified since
since that LSN. The LSN can be very old, so it's possible that the page
server has already garbage collected away the layer at that LSN. We tried
to fetch the old layer and errored out if that happened. To fix, always
fetch the data as of last-record-LSN, if 'latest' is set in the client
request. We now only use the LSN to wait if the requested LSN hasn't been
received and processed yet.

Fixes https://github.com/zenithdb/zenith/issues/567
2021-09-17 18:56:05 +03:00
Dmitry Rodionov
9563336d9a Bring back check for interferring processes, add more comments and
descriptive errors
2021-09-15 14:02:15 +03:00
Dmitry Rodionov
4ebe643d0c Support parallel test running for python tests
Support is done via pytest-xdist plugin.
To use the feature add -n<concurrency> to pytest invocation
e.g. pytest -n8 to run 8 tests in parallel.

Changes in code are mostly about ports assigning. Previously port for
pageserver was hardcoded without the ability to override through zenith
cli and ports for started compute nodes were calculated twice, in zenith
cli and in test code. Now zenith cli supports port arguments for
pageserver and compute nodes to be passed explicitly.

Tests are modified in such a way that each worker gets a non overlapping
port range which can be configured and now contains 100 ports. These
ports are distributed to test services (pageserver, wal acceptors,
compute nodes) so they can work independently.
2021-09-15 14:02:15 +03:00
Dmitry Rodionov
b4ecae33e4 add incremental tracking of logical timeline size
In order to exclude problems with synchronizing disk and memory logical
size is not stored in metadata on disk. It is calculated on timeline
"start" by scanning the contents of layered repo and then size is maintained
via an atomic variable.

This patch also adds new endpoint to pageserver http api: branch detail.
It allows retrieval of a particular branch info by its name. Size info
is also added to the response of the endpoint and used in tests.
2021-09-07 18:25:15 +03:00
anastasia
94c50e3e90 Fix check_restored_datadir_content(). Call 'basebackup' command directly, instead of relying on CLI 2021-09-07 16:58:21 +03:00
anastasia
1e172230ce Add test funciton to compare files in compute nodes to catch bugs in SLRU replay.
Compare files in existing compute node's pgdata with fresh basebackup at the same lsn. We expect that content is identical, except tmp files
Use it after some tests.
2021-09-06 18:21:23 +03:00
Stas Kelvich
ed4eed0a19 Make use of postgres --sync-safekeepers in tests and CLI.
Change control plane code to call `postgres --sync-safekeepers` before
compute node start when safekeepers are enabled. Now `pg create` will
create an empty data directory with the proper config file. Subsequent
`pg start` will run `sync-safekeepers` and will call basebackup with
the resulting LSN. Also change few tests to accommodate this new behavior.
2021-09-06 13:06:20 +03:00
Konstantin Knizhnik
b227c63edf Set proper xl_prev in basebackup, when possible.
In a passing fix two minor issues with basabackup:
* check that we can't create branches with pre-initdb LSN's
* normalize branch LSN's that are pointing to the segment boundary

patch by @knizhnik
closes #506
2021-09-03 14:58:59 +03:00
anastasia
fabf5ec664 Don't use term 'snapshot' to describe layers 2021-09-03 11:00:38 +03:00
Stas Kelvich
ddd2c83c64 Change test_restart_compute to expose safekeeper problems.
Make this test look like 'test_compute_restart.sh' by @ololobus, which
was surprisingly good for checking safekeepers behavior. This test adds
an intermediate compute node start with bulk select that causes a lot of
FPI's and select itself wouldn't wait for all that WAL to be replicated.
So if we kill compute node right after that we end up with lagging safekeepers
with VCL != flush_lsn. And starting new node from that state takes special
care.

Also, run and print `pg_controldata` output after each compute node start
to eyeball lsn/checkpoint info of basebackup.

This commit only adds test without fixing the problem.
2021-09-02 12:06:12 +03:00
anastasia
27442c3daa Add test for DROP DATABASE command 2021-08-30 17:29:29 +03:00
Heikki Linnakangas
5998744bcc Remove rocksdb implementation.
The layered storage format is good enough that we don't need the rocksdb
implementation anymore. There are a lot of known issues but we'll keep
working on them.
2021-08-25 18:37:22 +03:00
Dmitry Rodionov
23b5249512 translate pageserver api to http 2021-08-24 19:05:00 +03:00