Commit Graph

537 Commits

Author SHA1 Message Date
anastasia
90e5b6f983 Don't try to reconnect failed walreceiver. If necessary, wal service will send new callmemaybe request 2021-12-20 12:34:28 +03:00
Heikki Linnakangas
75cbaafb96 Remove old ephemeral files on pageserver restart.
The ephemeral files are not usable after restart, so just delete them.
Before this, you got "unrecognized filename in timeline dir" warnings
about them, as Konstantin noted at:
https://github.com/zenithdb/zenith/issues/906#issuecomment-995530870.

While we're at it, refactor away the list_files() function, moving the
logic fully into the caller. Seems more straightforward.
2021-12-17 00:00:02 +02:00
Heikki Linnakangas
d8a367dd32 Remove dead code, fix typos. 2021-12-15 19:58:03 +02:00
Kirill Bulatov
ca60561a01 Propagate disk consistent lsn in timeline sync statuses 2021-12-15 15:13:21 +02:00
Heikki Linnakangas
7f78e80c51 Refactor WAL ingestion code.
Rename save_decoded_record() to ingest_record(), and move the
responsibility for decoding the record into ingest_record().

Also move the responsibility of updating the CheckPoint relish to
ingest_record(). Put it in a new WalIngest struct, to help with tracking
that.
2021-12-14 20:24:03 +02:00
Heikki Linnakangas
f8f88154d5 Split restore_local_repo.rs into two files, with more descriptive names. 2021-12-14 20:24:03 +02:00
Kirill Bulatov
5cff7d1de9 Use proper download order 2021-12-14 15:32:22 +02:00
Heikki Linnakangas
e0d41ac6a3 Move constants related to metadata file to metadata.rs.
They're not used anywhere else, so seems like a better place.
2021-12-13 16:57:16 +02:00
Heikki Linnakangas
72ef59c378 Fix small typos in comments, add a comment.
The introducing paragraph README could use some more love, but let's at
least fix the typos.
2021-12-13 13:44:08 +02:00
Kirill Bulatov
673c297949 Download timelines on demand 2021-12-10 17:23:35 +02:00
Kirill Bulatov
e61732ca7c Compress checkpoint files before streaming into S3 2021-12-10 17:23:35 +02:00
Heikki Linnakangas
cb4a8396fb Use rustls rather than native-tls in all dependencies.
We depends on rustls in postgres_backend anyway, so might as well use it
for all TLS stuff. Seems better to depend on only one library both from a
security point of view, and because fewer dependencies means less code to
compile. With this commit, we no longer depend on OpenSSL.
2021-12-10 15:14:27 +02:00
Heikki Linnakangas
c77e30116e Split waldecoder.rs into two source files.
Move the code for decoding a WAL stream into WAL records into
'postgres_ffi', and keep the code to parse the WAL records deeper in
'pageserver' crate, renamed to walrecord.rs.

This tidies up the dependencies a bit. 'walkeeper' reuses the same
waldecoder routines, and it used to depend on 'pageserver' because of
that. Now it only depends on 'postgres_ffi'.

(The comment in walkeeper/Cargo.toml that claimed that the dependency was
needed for ZTimelineId was obsolete. ZTimelineId is defined in
'zenith_utils', the dependency was actually needed for the waldecoder.)
2021-12-10 15:14:13 +02:00
Heikki Linnakangas
9d369f158c Update rust-s3 to version 0.28.0
0.28.0 includes two changes I submitted to upstream:

- Add support for older ListObjects API, needed to use rust-s3 with Google
  Cloud Storage: https://github.com/durch/rust-s3/pull/229

- If file is smaller than one chunk, don't initiate multi-part upload.
  https://github.com/durch/rust-s3/pull/228

These are not critical for Zenith right now, but let's stay up-to-date.
2021-12-10 14:52:08 +02:00
Heikki Linnakangas
f3f059c1f8 Fix a few cases where request beyond end of rel would error out.
Currently, we return an all-zeros page if you request a block beyond end of
a relation. That has been implemented in LayeredTimeline::materialize_page,
so that if Layer::get_page_reconstruct_data returns Missing, it returns
and all-zeros page.

However InMemoryLayer and DeltaLayer would return Continue, not Missing,
in that case, and materialize_page would try to find the predecessor
layer. If there was a preceding image layer, then everything would still
work, but if there wasn't, it would return a "could not find predecessor
of layer" error. Fix that in InMemoryLayer and DeltaLayer, making them
check the size of the relation and return Missing in that case.

This is hard to reproduce at the moment, but it happened quickly with
pgbench when I modified InMemoryLayer::write_to_disk so that it didn't
always create a new ImageLayer.
2021-12-09 17:46:48 +02:00
Dmitry Ivanov
7cec13d1df Improve shutdown story for code coverage
This patch introduces fixes for several problems affecting
LLVM-based code coverage:

* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.

* Implement proper shutdown handlers in safekeeper.
2021-12-06 13:27:52 +03:00
anastasia
c7f3b4e62c Clarify the meaning of StandbyReply LSNs:
write_lsn - The last LSN received and processed by pageserver's walreceiver.
flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers.
apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).
2021-12-06 12:49:42 +03:00
Heikki Linnakangas
5bad2deff8 Don't hold 'timelines' lock over checkpoint.
It was very noticeable that you while the checkpointer was busy, you
could not e.g. open a new connection.
2021-12-03 07:42:10 -05:00
Heikki Linnakangas
f49ad33f1b Initialize 'loaded' correctly in DeltaLayer.
While we're at it, reuse the Book and the VirtualFile that's backing
it even over unload() calls. Previously, we would keep the Book open,
but on load(), we would re-open it anyway, which didn't make much
sense. Now we reuse it it. Alternatively, perhaps we should close it
on unload() to save some memory, but this I'm not going to think too
hard about it right now as the whole load/unload thing is a bit of a
hack and needs to be rewritten.

This is hard to reproduce ATM, because the incorrect state would get
fixed by an unload(). A checkpoint creates the DeltaLayer, and it also
calls unload() afterwards, so the window is not very large. I hit it
occasionally with a scale 1000 pgbench test, after I had modified
InMemoryLayer::write_to_disk() to not write an image layer every time,
which made the DeltaLayers be accessed more often.
2021-11-30 22:23:59 +02:00
Kirill Bulatov
670205e17a Evict excessively failing sync tasks, improve processing for the rest of
the tasks
2021-11-30 13:58:49 +02:00
Konstantin Knizhnik
f72d4814b1 Extract page images from FPI WAL records (#949)
* Extract page images from FPI WAL records

* Fix issues reported in review
2021-11-30 12:57:26 +03:00
Heikki Linnakangas
5ecf0664cc Fix off-by-one error in check for future delta layers.
This doesnt show up at the moment, because we never create a delta
layer with end-LSN equal to the last LSN. We always create an image
layer at that LSN instead. For example, if the latest processed LSN is
100, we would create a delta layer with end LSN 100 (exclusive), and
an image layer at 100. But that's just how InMemoryLayer::write_to_disk
happens to work at the moment, there's no fundamental reason it needs
to always create that image layer. I noticed this bug when I tried to
change the logic in InMemoryLayer::write_to_disk to only create an
image layer after a few delta layers.
2021-11-29 14:35:24 +02:00
Heikki Linnakangas
7cae265447 Fix dump_layerfile.
The VirtualFile machinery panics if it's not initialized
2021-11-29 11:26:54 +02:00
Heikki Linnakangas
5aa969a588 Replace in-memory layers and OOM-triggered eviction with temp files.
The "in-memory layer" is misnomer now, each in-memory layer is now actually
backed by a file. The files are ephemeral, in that they don't survive page
server crash or shutdown.

To avoid reading the file for every operation,
"ephemeral files" are cached in a page cache.

This includes changes from 'inmemory-layer-chunks' branch to serialize /
the page versions when they are added to the open layer. The difference is
that they are not serialized to the expandable in-memory "chunk buffer", but
written out to the file.
2021-11-26 17:25:17 +03:00
Dmitry Rodionov
130184fee9 Prohibit branch creation and basebackup at out of scope lsns
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.

To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
  when basebackup retured an error. Previously pageserver thread just
  died.
* Remove "ancestor" file which previously contained ancestor id and
  branch lsn. Currently the same data can be obtained from metadata file.
  And just the way we handled ancestor file in the code introduced the
  case when branching fails timeline directory is created but there is no data in it
  except ancestor file. And this confused gc because it scans
  directories. So it is better to just remove ancestor file and clean up
  this timeline directory creation so it happens after all validity
  checks have passed
2021-11-25 15:27:16 +03:00
Heikki Linnakangas
d47f610606 Fix pageserver CLI parameter names and document them 2021-11-25 13:31:52 +02:00
Dmitry Rodionov
0650e51f0b add test one more case for layer visibility 2021-11-22 11:39:20 +03:00
Dmitry Rodionov
6f7ebe6e01 preserve data in parent branch that might be referenced in child branch 2021-11-22 11:39:20 +03:00
Kirill Bulatov
b32da3b42e Use less pageserver-specific method in RemoteStorage trait 2021-11-18 22:53:40 +02:00
Heikki Linnakangas
f8702d4625 Fix checking for whether segment exists on a frozen in-memory layer.
Ever since we've had frozen in-memory layers, having an 'end_lsn' no
longer means that the layer has been dropped. Need to check the 'dropped'
flag explicitly.

This was reliably causing a failure on the new 'test_parallel_copy' test
in https://github.com/zenithdb/zenith/pull/864. I'm not sure why it
doesn't happen on main branch, but the bug is pretty straightforward when
you see it.
2021-11-15 20:19:15 +02:00
Dmitry Rodionov
44111e3ba3 Prohibit branch creation at lsn that was already garbage collected.
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.
2021-11-15 20:03:16 +03:00
Patrick Insinger
298bc588f9 pageserver - don't try to GC InMemoryLayers 2021-11-15 09:01:45 -08:00
Heikki Linnakangas
431d32756b Add a buffer cache, and use it to store materialized pages.
The buffer cache is shared across all tenants, allowing memory to be
dynamically allocated where it's needed the most. The cache works on 8 kB
pages, and uses the clock algorithm for replacement policy; same as the
PostgreSQL buffer cache.

One peculiarity is that the materialized page versions can be looked up
by an inexact LSN, to find the latest page version with an LSN >= the
search key.

The code is structured to support caching other kinds of pages in the same
cache in the future, but with a different mapping key.

Co-authored-by: Patrick Insinger <patrick@zenith.tech>
2021-11-12 11:02:12 -08:00
Heikki Linnakangas
3d172d98a3 Improve layered repo README.
Add an informal overview of how it works.
2021-11-12 19:59:31 +02:00
Heikki Linnakangas
849ac791a6 Bandaid fix for "page not found" errors, when a table is loaded.
During parallel load of a table, Postgres sometimes requests a page from
the page server for which no WAL has been generated yet. That's normal;
Postgres expects the page to be full of zeros. There was a special case
for that in LayeredTimeline::materialize_page, but the problem remained
when you're crossing a segment boundary, so that there's no layer for
the segment at all.

It would be nice to have a more robust cross-check for this case. That
might need help from the Postgres side. But this extends the bandaid fix
we had in materialize_page() to the case where cross segment boundary.

Fixes https://github.com/zenithdb/zenith/issues/841
2021-11-12 18:47:39 +02:00
Heikki Linnakangas
d1e79c4af3 Fix locking issues in VirtualFile machinery.
There were two separate locking issues that could lead to a deadlock,
both related to holding a lock for longer than necessary:

1. In the loop in `VirtualFile::with_file`, the "handle_guard" was
held across iterations of the loop. Because of that, if the handle was
changed by a concurrent thread, the loop would try to acquire the
handle lock, when it was still holding the lock from previous
iteration. To fix, release the lock earlier. There was no need to hold
it across iterations, it was just accidental.

2. In the same function, we also held the "slot_guard" longer than
necessary. It's only needed in the first part of the loop, where we
check if the current handle is valid. If it's not, the slot lock can
be immediately released. But it was not, it was kept over the
acquisition of the handle lock. I'm not sure if that alone could cause
problems, but let's release the lock as soon as possible anyway.

Add a test case, based on Konstantin's test program to demonstrate the
deadlock.
2021-11-11 20:12:59 +02:00
Kirill Bulatov
abb2ac5246 Better context when erroring 2021-11-11 19:22:05 +02:00
Kirill Bulatov
99dbbe5f18 Allow downloading remote files partially 2021-11-11 18:51:34 +02:00
Arseny Sher
e7ca8ef5a8 Use PG timelineid 1 everywhere.
As changing it doesn't have useful meaning in Zenith.

ref #824
2021-11-11 13:53:39 +03:00
Patrick Insinger
1ce4976e36 pageserver - track size of VecMaps 2021-11-10 11:09:34 -08:00
Heikki Linnakangas
9300107cdf Cache Book objects, use virtual files to avoid running out of fds.
Currently, whenever a page version is needed from an image or delta
layer, we open the file and read and parse the bookfile headers. That's
pretty expensive. To reduce the overhead, introduce a cache of open file
descriptors, and use that to cache the Book objects so that we don't need
to read the metadata on every access.
2021-11-10 17:19:37 +02:00
Heikki Linnakangas
6d742719a1 Fix infinite loop in looking up predecessor layer
Commit 960c7d69a8 changed the LSN returned in the Continue case in
InMemoryLayer::get_page_reconstruct_data(), but neglected to make the
same change in DeltaLayer.

Also add an escape hatch to the loop in materialize_page() to avoid
getting stuck in an infinite loop, if a bug like this reoccurs.
2021-11-04 16:07:12 +02:00
Dmitry Rodionov
987833e0b9 Propagate git SHA to zenith binaries
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
2021-11-04 14:22:29 +03:00
Kirill Bulatov
f36acf00de Reduce "relish" word usages in remote storage 2021-11-04 12:53:42 +02:00
Kirill Bulatov
956fc3dec9 Tidy up and make consistent the remote storate API 2021-11-04 12:53:42 +02:00
Heikki Linnakangas
b38e841f2d Use poll() in communication with WAL redo process.
The tokio futures added some overhead, so switch to plain non-blocking
I/O with poll(). In a simple pgbench test on my laptop (select-only
queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10%
improvement, from about 4300 TPS to 4800 TPS.
2021-11-04 10:39:04 +02:00
Heikki Linnakangas
3a0111c75e Refactor functions for constructing WAL redo messages.
Instead of building a separate Vec<u8> to hold each message, serialize all
the messages to one big Vec<u8>. This eliminates some Vec allocation and
memcpy() overhead. The downside is that if there are a lot of records to
replay, we have to serialize them all into one big chunk of memory.
That shouldn't be a problem in practice. If you need to replay millions
of records to reconstruct a page, we should've materialized a new image
of that page earlier already.
2021-11-04 10:39:00 +02:00
Heikki Linnakangas
4ba783d0af Remove a couple of unused functions.
We might want to have custom serialize/deserialize functions for
WALRecords and PageVersions for performance reasons, see github issue 832.
But that would probably look a bit different from this, and currently
these functions are just dead.
2021-11-03 19:10:23 +02:00
Patrick Insinger
0457fe81a9 pageserver - make PageVersion an enum 2021-11-03 09:28:49 -07:00
Heikki Linnakangas
fb524dd973 Put a global limit on memory used by in-memory layers.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.

After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.

This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.

Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
2021-11-02 15:49:39 +02:00