by binding sockets before daemonization
also use less annoying error reporting by not printing full error
messages for connect errors in first several connection retries
closes#507
In order to exclude problems with synchronizing disk and memory logical
size is not stored in metadata on disk. It is calculated on timeline
"start" by scanning the contents of layered repo and then size is maintained
via an atomic variable.
This patch also adds new endpoint to pageserver http api: branch detail.
It allows retrieval of a particular branch info by its name. Size info
is also added to the response of the endpoint and used in tests.
- Reorder the structs and functions
- Delegate many of the operations in LayerMap to SegEntry. For example,
`LayerMap::insert_open` now looks up the right SegEntry struct, and
then calls `SegEntry::insert_open` on it.
- Use HashMap::entry() function with or_default() to implement the lookups
with less code
Commit 66929ad6fb added a 'generation' number to open segments stored
in the layer map, to distinguish old layers from layers that were
added to the map during checkpoint processing. But it neglected the
OpenSegEntry::cmp() function.
It seems that the cmp() function is never used by BinaryHeap, so this
didn't cause any user-visible bugs (I tried adding a panic() to the
cmp() function and it didn't fire). But it's clearly wrong and we need
to fix it, anyway.
1) Do epoch switch without record from new epoch, immediately after recovery --
--sync-safekeepers mode doesn't generate new records.
2) Fix commit_lsn advancement by taking into account wal we have locally --
setting it further is incorrect.
3) Report it back to walproposer so he knows when sync is done.
4) Remove system id check as it is unknown in sync mode.
And make logging slightly better.
ref #439
In a passing fix two minor issues with basabackup:
* check that we can't create branches with pre-initdb LSN's
* normalize branch LSN's that are pointing to the segment boundary
patch by @knizhnik
closes#506
This should fix the sporadic regression test failures we've been seeing
lately with "no base img found" errors.
This fixes the common case, but one corner case is still not handled:
If a relation is extended across a segment boundary, leaving a gap block
in the segment preceding the segment containing the target block, the
preceding segment will not be padded with zeros correctly. This adds
a test case for that, but it's commented out.
See github issue https://github.com/zenithdb/zenith/issues/500
To fix, break out of the loop when you reach an in-memory layer that was
created after the checkpoint started. To do that, add a "generation"
counter into the layer map.
Fixes https://github.com/zenithdb/zenith/issues/494
Previously, the InMemoryLayer and DeltaLayer implementations of
get_page_reconstruct_data would recursively call the predecessor layer's
get_page_reconstruct_data function. Refactor so that we iterate in the
caller instead. Make get_page_reconstruct_data() return the predecessor
layer along with the continuation LSN, so that the caller can iterate.
IMO this makes the logic more clear, although this is more lines of code.
DeltaLayer uses the name `predecessor` for the same thing. Use the
same name in InMemoryLayer. The 'img_layer' name was misleading, as
the predecessor layer is not necessarily an image layer. Currently,
the 'freeze' function always creates a new image layer, but it
wouldn't have to be that way. Also, when you create a new branch, at
the branch point the predecessor layer can be a delta layer on the
ancestor branch.
* add lsn argument
* do not expose wait_lsn, wait inside list_nonrels()
* fix parameters parsing
* expose get_last_record_rlsn() to atomically read (last,prev) pair
More work is needed to correctly handle basebackup@old_lsn but current
approach already allows to fix test_restart_compute
There are two main reasons for that:
a) Latest unfinished record may disapper after compute node restart, so let's
try not leak volatile part of the WAL into the repository. Always use
last_valid_record instead.
That change requires different getPage@LSN logic in postgres -- we need
to ask LSN's that point to some complete record instead of GetFlushRecPtr()
that can point in the middle of the record. That was already done by @knizhnik
to deal with the same problem during the work on `postgres --sync-safekeepers`.
Postgres will use LSN's aligned on 0x8 boundary in get_page requests, so we
also need to be sure that last_valid_record is aligned.
b) Switch to get_last_record_lsn() in basebackup@no_lsn. When compute node
is running without safekeepers and streams WAL directly
to pageserver it is important to match basebackup LSN and LSN of replication
start. Before this commit basebackup@no_lsn was waiting for last_valid_lsn
and walreceiver started replication with last_record_lsn, which can be less.
So replication was failing since compute node doesn't have requested WAL.
I noticed that the timeline directory contained files like this:
pg_xact_0000_0_000000000169C3C2_00000000016BB399
pg_xact_0000_0_00000000016BB399
pg_xact_0000_0_00000000016BB399_00000000016BDD06
pg_xact_0000_0_00000000016BDD06
pg_xact_0000_0_00000000016BDD06_00000000016C63AA
pg_xact_0000_0_00000000016C63AA
pg_xact_0000_0_00000000016C63AA_0000000001765226_DROPPED
pg_xact_0000_0_0000000001765226
pg_xact_0001_0_00000000016BB77E_00000000016BDD06
pg_xact_0001_0_00000000016BDD06
pg_xact_0001_0_00000000016BDD06_0000000001765226_DROPPED
pg_xact_0001_0_0000000001765226
Note how there is an image file after each DROPPED file. It's a waste of
time and space to materialize an image of the file at the point where it's
dropped, no one is going to request pages on a dropped relation. And it's
a correctness issue too: list_rels() and list_nonrels() will not consider
the relation as unlinked, unless the latest layer indicates so, and there
is no concept of a dropped image layer. That was causing test_clog_truncate
test to fail, when I adjusted the checkpointer to force a checkpoint more
aggressively.
There are a bunch more issues related to dropped rels and branching,
see https://github.com/zenithdb/zenith/issues/502. Hence this doesn't
completely fix the issue I saw with test_clog_truncate either. But it's
a start.
The comment talked about the WAL redo thread, but commit 6e22a8f709
refactored that away. The problem the comment describes probably still
exists, so keep the comment, but update the wording.
Avoid slurping entire image files into memory.
For blocky segments, we write the bytes directly to a bookfile chapter.
The blocks are a fixed size, which allows for random access.
split the page versions into two chapters:
PAGE_VERSION_METAS - a rust BTreeMap from (block #, lsn) -> page & WAL
byte ranges in PAGE_VERSIONS_CHAPTER
PAGE_VERSIONS_CHAPTER - raw page images and serialized WAL records
Once upon a time, 'page_cache.rs' contained an actual page cache, but
it hasn't for a very long time. Rename to reflect what it actually does
these days.
Previously, a SnapshotLayer and corresponding file on disk contained the
base image of every page in the segment at the start LSN, and all the
changes (= WAL records) in the range between start and end LSN. That was
a bit awkward, because we had to keep the base image of every page in
memory until we had accumulated enough WAL after the base image to write
out the layer. When it's time to write out a layer, we would really want
to replay the WAL to reconstruct the most recent version of each page, to
save the effort later. That's on the assumption that the client will
usually request the most recent version, not some older one.
Split the SnapshotLayer into two structs: ImageLayer and DeltaLayer. An
image layer contains a "snapshot" of the segment at one specific LSN, and
no WAL records, whereas a delta layer contains WAL records in a range of
LSNs. In order to reconstruct a page version in the delta layer, by
performing WAL redo, you also need the previous image layer. So the delta
layers are "incremental" against the previous layer.
So where previously we would create snapshot files like this:
rel_100_200
rel_200_300
rel_300_400
We now create image and delta files like this:
rel_100 # image
rel_100_200 # delta
rel_200
rel_200_300
rel_300
rel_300_400
rel_400
That's more files, but as discussed above, this allows storing more
up-to-date page versions on disk, which should reduce the latency of
responding to a GetPage request. It also allows more fine-grained garbage
collection. In the above example, after the old page version are no longer
needed and if the relation is not modified anymore, we only need to keep
the latest image file, 'rel_400', and everything else can be removed.
Implements https://github.com/zenithdb/zenith/issues/339