Commit Graph

909 Commits

Author SHA1 Message Date
Konstantin Knizhnik
beaa2cd0a2 Handle COPY error 2021-08-26 13:53:10 +03:00
Arseny Sher
c4450907e5 Don't hide exact error of get_timeline.
ref #470
2021-08-25 20:46:31 +03:00
Heikki Linnakangas
de9d5e0aa4 Remove unnecessary dependencies.
Found by "cargo udeps"
2021-08-25 18:51:15 +03:00
Heikki Linnakangas
4046530160 Remove remnants of choosing between repository formats.
Now that we only have one Repository implementation, no need for the
command-line options to choose it either. I'm removing these as a separate
commit to show what we will need to do if we add another Repository
implementation in the future (even though I don't foresee us doing that
any time soon)
2021-08-25 18:37:22 +03:00
Heikki Linnakangas
5998744bcc Remove rocksdb implementation.
The layered storage format is good enough that we don't need the rocksdb
implementation anymore. There are a lot of known issues but we'll keep
working on them.
2021-08-25 18:37:22 +03:00
Heikki Linnakangas
250ae643a8 Remove 'zenith push' feature.
Now that the new storage format is based on immutable files, we want to
implement push/pull in terms of these immutable files as well. Similarly
to how those files will be transferred between S3 and the page server.
The implementation we had was fairly tightly coupled with the object
repository implementation, but I'm about to remove the object / rocksdb
storage format soon. That would leave the current "zenith push" command
completely broken.

It seemed like a good idea at the time, but in hindsight, it was premature
to implement push/pull yet. It's a nice feature and I'd like to see it
reimplemented in the future, but in the meanwhile, let's remove the code
we had. We can dig the parts of it that might be useful in the future
from the git history.
2021-08-25 18:37:22 +03:00
Dmitry Ivanov
3edad463fb Adjust docker container for console's CI pipeline 2021-08-25 17:28:42 +03:00
Heikki Linnakangas
19fcea99da If too much memory is being used for in-memory layers, flush oldest one.
The old policy was to flush all in-memory layers to disk every 10 seconds.
That was a pretty dumb policy, unnecessarily aggressive. This commit
changes the policy so that we only flush layers where the oldest WAL
record is older than 16 MB from the last valid LSN on the timeline. That's
still pretty aggressive, but it's a step in the right direction. We do
need a limit on how old the oldest in-memory layer is allowed to be,
because that determines how much WAL the safekeepers need to hold onto,
and how much WAL we need to reprocess in case of a page server crash.
16 MB is surely still too aggressive for that, but it's easy to change
the setting later.

To support that, keep all in-memory layers in a binary heap, so that we
can easily find the one with the oldest LSN.

This tracks and a new LSN value in the metadata file: 'disk_consistent_lsn'.
Before, on page server restart we restarted the WAL processing from the
'last_record_lsn' value, but now that we don't flush everything to disk in
one go, the 'last_record_lsn' tracked in memory is usually ahead of the
last record that's been flushed to disk. Even though we track that oldest
LSN now, the crash recovery story isn't really complete. We don't do
fsync()s anywhere, and thing will break if a snapshot file isn't complete,
as there's no CRC on them. That's not new, and it's a TODO.
2021-08-25 11:20:47 +03:00
Dmitry Rodionov
f2f02a8af0 apply transformation (Arc<Option> -> Option<Arc>) suggested by @funbringer 2021-08-24 19:05:00 +03:00
Dmitry Rodionov
b135723994 review adjustments 2021-08-24 19:05:00 +03:00
Dmitry Rodionov
23b5249512 translate pageserver api to http 2021-08-24 19:05:00 +03:00
Eric Seppanen
41fa02f82b Replace transmute with serde
Upgrade to bindgen 0.59, which has two new abilities:
- specify arbitrary #[derive] attributes to attach to generated structs
- request explicit padding fields

These two features are enough to replace transmute with serde/bincode.
2021-08-24 16:32:37 +03:00
Heikki Linnakangas
81dd4bc41e Fix decoding XLOG_HEAP_DELETE and XLOG_HEAP_UPDATE records.
Because the t_cid field was missing from the XlHeapDelete struct that
corresponds to the PostgreSQL xl_heap_delete struct, the check for the
XLH_DELETE_ALL_VISIBLE_CLEARED flag did not work correctly.

Decoding XlHeapUpdate struct was also missing the t_cid field, but that
didn't cause any immediate problems because in that struct, the t_cid
field is after all the fields that the page server cares about. But fix
that too, as it was an accident waiting to happen.

The bug was mostly hidden by the VM page handling in zenith_wallog_page,
where it forcibly generates a FPW record whenever a VM page is evicted:

    else if (forknum == VISIBILITYMAP_FORKNUM && !RecoveryInProgress())
    {
        /*
         * Always WAL-log vm.
         * We should never miss clearing visibility map bits.
         *
         * TODO Is it too bad for performance?
         * Hopefully we do not evict actively used vm too often.
         */
        XLogRecPtr recptr;
        recptr = log_newpage_copy(&reln->smgr_rnode.node, forknum, blocknum, buffer, false);
        XLogFlush(recptr);
        lsn = recptr;

But that was just hiding the issue: it's still visible if you had a
read-only node relying on the data in the page server, or you killed and
restarted the primary node, or you started a branch. In the included test
case, I used a new branch to expose this.

Fixes https://github.com/zenithdb/zenith/issues/461
2021-08-24 15:59:25 +03:00
anastasia
ad8b5c3845 use updated vendor/postgres 2021-08-23 18:19:59 +03:00
Dmitry Rodionov
dcaa2126f1 fix code format after main rebase 2021-08-23 18:01:59 +03:00
Dmitry Rodionov
b29ca232d6 add ability to disable colors, use argparse for arguments 2021-08-23 17:28:45 +03:00
Dmitry Rodionov
8c62b11bd5 adjust for review 2021-08-23 17:28:45 +03:00
Dmitry Rodionov
35b60d509f Add support for code format checking using rustfmt in optional
pre-commit hook and in ci pipeline. Found issues can be fixed
automatically via make fmt.
2021-08-23 17:28:45 +03:00
Dmitry Rodionov
d989580c1c remove small code duplication involving InMemoryLayer::get_seg_size, and remove redundant Option around new snapshot layer in InMemoryLayer::freeze 2021-08-23 13:00:05 +03:00
anastasia
798160544c Update zenith readmes:
- Move source tree overview into separate docs/sourcetree.md and update it.
- Add glossary: docs/glossary.md
- Add a draft of Architecture overview to main Readme.md
2021-08-23 10:21:10 +03:00
Max Sharnoff
39bb6fb19c Marginally improve walkeeper error visibility (#440)
Adds a warning if a postgres query fails, and some additional context to
errors generated inside `ReceiveWalConn::run`
2021-08-19 08:46:18 -07:00
Dmitry Rodionov
82725725fd update README to match required Rust version and new python package installation process 2021-08-19 17:42:52 +03:00
Alexey Kondratov
1c3d51ed92 Add Docker images building doc and refactor the overall docs reference 2021-08-19 15:12:35 +03:00
Alexey Kondratov
04a309f562 Build zenithdb/zenith:latest in CI (zenithdb/console#18) 2021-08-19 15:12:35 +03:00
anastasia
20e6cd7724 Update test_twophase - check that we correctly restore files at compute node start. 2021-08-19 12:15:09 +03:00
Heikki Linnakangas
9fed5c8fb7 Add test for page server restart. 2021-08-18 20:19:07 +03:00
Dmitry Rodionov
4bce65ff9a bump rust version in ci to 1.52.1 2021-08-17 20:31:28 +03:00
Heikki Linnakangas
3319befc30 Revert a bunch of commits that I pushed by accident
This reverts commits:
  e35a5aa550
  a389c2ed7f
  11ebcb531f
  8d2b61f4d1
  882f549236
  ddb7155bbe

Those were follow-up work on top of PR
https://github.com/zenithdb/zenith/pull/430, but they were still very
much not ready.
2021-08-17 19:20:27 +03:00
Heikki Linnakangas
ddb7155bbe WIP Store base images in separate ImageLayers 2021-08-17 18:55:04 +03:00
Heikki Linnakangas
882f549236 WIP: store base images separately 2021-08-17 18:54:53 +03:00
Heikki Linnakangas
8d2b61f4d1 Move code to handle snapshot filenames 2021-08-17 18:54:53 +03:00
Heikki Linnakangas
11ebcb531f Add Gauge for # of layers 2021-08-17 18:54:53 +03:00
Heikki Linnakangas
a389c2ed7f WIP: Track oldest open layer 2021-08-17 18:54:53 +03:00
Heikki Linnakangas
e35a5aa550 WIP: track mem usage 2021-08-17 18:54:53 +03:00
Heikki Linnakangas
45f641cabb Handle last "open" layer specially in LayerMap.
There can be only one "open" layer for each segment. That's the last one,
implemented by InMemoryLayer. That's the only one where new records can
be appended to. Much of the code needed to distinguish between the last
open layer and other layers anyway, so make the distinction explicit
in LayerMap.
2021-08-17 18:54:51 +03:00
Heikki Linnakangas
48f4a7b886 Refactor get_page_at_lsn() logic to layered_repository.rs
There was a a lot of duplicated code between the get_page_at_lsn()
implementations in InMemoryLayer and SnapshotLayer. Move the code for
requesting WAL redo from the Layer trait into LayeredTimeline. The
get-function in Layer now just returns the WAL records and base image
to the caller, and the caller is responsible for performing the WAL
redo on them.
2021-08-17 18:54:48 +03:00
Heikki Linnakangas
91f72fabc9 Work with smaller segments.
Split each relish into fixed-sized 10 MB segments. Separate layers are
created for each segment. This reduces the write amplification if you
have a large relation and update only parts of it; the downside is
that you have a lot more files. The 10 MB is just a guess, we should
do some modeling and testing in the future to figure out the optimal
size.

Each segment tracks the size of the segment separately. To figure out
the total size of a relish, you need to loop through the segment to
find the highest segment that's in use. That's a bit inefficient, but
will do for now. We might want to add a cache or something later.
2021-08-17 18:54:41 +03:00
anastasia
cbeb67067c Issue #367.
Change CLI so that we always create node from scratch at 'pg start'.
This operation preserve previously existing config

Add new flag '--config-only' to 'pg create'.
If this flag is passed, don't perform basebackup, just fill initial postgresql.conf for the node.
2021-08-17 18:12:31 +03:00
anastasia
921ec390bc cargo fmt 2021-08-16 19:41:07 +03:00
Heikki Linnakangas
f37cb21305 Update Cargo.lock for addition of 'bincode'
Commit 5eb1738e8b added a dependency to the 'bincode' crate. 'cargo build'
adds it to Cargo.lock automatically, so let's remember it.
2021-08-16 19:24:26 +03:00
Heikki Linnakangas
7ee8de3725 Add metrics to WAL redo.
Track the time spent on replaying WAL records by the special Postgres
process, the time spent waiting for acces to the Postgres process (since
there is only one per tenant), and the number of records replayed.
2021-08-16 15:49:17 +03:00
Heikki Linnakangas
047a05efb2 Minor formatting and comment fixes. 2021-08-16 15:48:59 +03:00
Dmitry Rodionov
0c4ab80eac try to be more intelligent in WalAcceptor.start, added a bunch of typing sugar to wal acceptor fixtures 2021-08-16 14:27:44 +03:00
Heikki Linnakangas
2450f82de5 Introduce a new "layered" repository implementation.
This replaces the RocksDB based implementation with an approach using
"snapshot files" on disk, and in-memory btreemaps to hold the recent
changes.

This make the repository implementation a configuration option. You can
choose 'layered' or 'rocksdb' with "zenith init --repository-format=<format>"
The unit tests have been refactored to exercise both implementations.
'layered' is now the default.

Push/pull is not implemented. The 'test_history_inmemory' test has been
commented out accordingly. It's not clear how we will implement that
functionality; probably by copying the snapshot files directly.
2021-08-16 10:06:48 +03:00
Max Sharnoff
5eb1738e8b Rework walkeeper protocol to use libpq (#366)
Most of the work here was done on the postgres side. There's more
information in the commit message there.
 (see: 04cfa326a5)

On the WAL acceptor side, we're now expecting 'START_WAL_PUSH' to
initialize the WAL keeper protocol. Everything else is mostly the same,
with the only real difference being that protocol messages are now
discrete CopyData messages sent over the postgres protocol.

For the sake of documentation, the full set of these messages is:

  <- recv: START_WAL_PUSH query
  <- recv: server info from postgres   (type `ServerInfo`)
  -> send: walkeeper info              (type `SafeKeeperInfo`)
  <- recv: vote info                   (type `RequestVote`)

  if node id mismatch:
    -> send: self node id (type `NodeId`); exit

  -> send: confirm vote (with node id) (type `NodeId`)

  loop:
    <- recv: info and maybe WAL block  (type `SafeKeeperRequest` + bytes)
         (break loop if done)
    -> send: confirm receipt           (type `SafeKeeperResponse`)
2021-08-13 11:25:16 -07:00
Heikki Linnakangas
6e22a8f709 Refactor WAL redo to not use a separate thread.
My main motivation is to make it easier to attribute time spent in WAL
redo to the request that needed the WAL redo. With this patch, the WAL
redo is performed by the requester thread, so it shows up in stack traces
and in 'perf' report as part of the requester's call stack. This is also
slightly simpler (less lines of code) and should be a bit faster too.
2021-08-13 17:23:36 +03:00
Heikki Linnakangas
f8de71eab0 Update vendor/postgres to fix race condition leading to CRC errors.
Fixes https://github.com/zenithdb/zenith/issues/413
2021-08-13 14:02:26 +03:00
Heikki Linnakangas
8517d9696d Move gc_iteration() function to Repository trait.
The upcoming layered storage implementation handles GC as a
repository-wide operation because it needs to pay attention to the branch
points of all timelines.
2021-08-12 23:46:01 +03:00
Heikki Linnakangas
97f9021c88 Fix JWT token encoding issue in test.
On my laptop, the server was receiving the token as a string with extra
b'...' escaping, e.g as "b'eyJ0....0ifQA'" instead of just "eyJ0....0ifQA".
That was causing the test to fail.

I'm using Python 3.9, while the CI is using Python 3.8. I suspect that's
why. My version of pyjwt might be different too.

See also https://github.com/jpadilla/pyjwt/issues/391.
2021-08-12 20:46:14 +03:00
Heikki Linnakangas
0a92b31496 If a pg_regress test fails in CI, save regression.diffs 2021-08-12 18:39:23 +03:00