If the 'latest' flag in the client request is true, the client wants the
latest page version regardless of the LSN in the request. The LSN is just
a hint in that case, indicating that the page hasn't been modified since
since that LSN. The LSN can be very old, so it's possible that the page
server has already garbage collected away the layer at that LSN. We tried
to fetch the old layer and errored out if that happened. To fix, always
fetch the data as of last-record-LSN, if 'latest' is set in the client
request. We now only use the LSN to wait if the requested LSN hasn't been
received and processed yet.
Fixes https://github.com/zenithdb/zenith/issues/567
- Use different message formats for different kinds of response messages.
- Add an Error message, for passing errors from page server to Postgres.
Previously, we would respond to 'exists' request with 'false', and
to 'nblocks' request with 0, if an error happened. Fix those to return
an error message to the client. GetPage requests had a mechanism to
return an error, but it was just a flag with no error message.
- Add a flag to requests, to indicate that we actually want the latest
page version on the timeline, and the LSN is just a hint that we know
that there haven't been any modifications since that LSN. The flag isn't
used for anything yet, but I'm planning to use it to fix
https://github.com/zenithdb/zenith/issues/567
Most of the previous usages of get_repository_for_tenant were followed
by immediately getting a timeline in that repository, without keeping it
around for longer.
The new `get_timeline_for_tenant` function implements that same
behavior, but in one line.
- Change hardcoded OLDEST_INMEM_DISTANCE value to pageserver config option checkpoint_distance.
- Get rid of 'force' flag in checkpoint_internal(). Use checkpoint_distance=0 instead.
Support is done via pytest-xdist plugin.
To use the feature add -n<concurrency> to pytest invocation
e.g. pytest -n8 to run 8 tests in parallel.
Changes in code are mostly about ports assigning. Previously port for
pageserver was hardcoded without the ability to override through zenith
cli and ports for started compute nodes were calculated twice, in zenith
cli and in test code. Now zenith cli supports port arguments for
pageserver and compute nodes to be passed explicitly.
Tests are modified in such a way that each worker gets a non overlapping
port range which can be configured and now contains 100 ports. These
ports are distributed to test services (pageserver, wal acceptors,
compute nodes) so they can work independently.
Data written to frozen layers is lost. It will not appear in on-disk
structures or in successor InMemoryLayers. Here we detect this race, and
fail. I think this race is rare, but this should make it easier to track
down when it happens.
Implement the changes suggested in a comment, create
`get_layer_for_read_locked` so that `get_layer_for_write` doesn't have
to drop the LayerMap lock when searching for the predecessor.
This contains a lowest common denominator of pageserver and safekeeper log
initialisation routines. It uses daemonize flag to decide where to
stream log messages. In case daemonize is true log messages are
forwarded to file. Otherwise streaming to stdout is used. Usage of
stdout for log output is the default in docker side of things, so make
it easier to browse our logs via builtin docker commands.
Otherwise we produce corrupted record holes in WAL during compute node restart
in case there was an unfinished record from the old compute, as these reports
advance commit_lsn -- reliably persisted part of WAL.
ref #549.
Mostly by @knizhnik. I adjusted to make sure proposer always starts streaming
since record beginning so we don't need special quirks for decoding in
safekeeper.
Ran into problems launching the WAL redo process on OS X after 4b73ad.
Launching the `initdb` process was met with "bad file descriptor" errors.
Using dtrace, I found shortly after calling `posix_spawn` for `initdb`,
`kevent` was returning this error.
I haven't dug super deep to see if the daemonization itself is the
problem, but this commit fixes it for me. My hunch is that some file
descriptors used when the Tokio runtime is initailzed become invalid
in the daemon process.
When you advance last record LSN, *all* changes up to that LSN should be
imported into repository. We have been a bit sloppy about that when it
comes to the checkpoint information that we also store in the repository.
In WAL receiver, for example, we would receive a WAL record, advance
last record LSN, and only then update the checkpoint relish at the same
LSN. Reorder that so that you advance the last record LSN only after
updating the checkpoint relish. It hasn't apparently caused any problems
so far, but let's be tidy.
Tighten the check for that in get_layer_for_write(), so that it checks for
'lsn > last_record_lsn' rather than 'lsn >= last_record_lsn'.
Always use lsn(0) as the initial last_record_lsn. It is updated soon after
creating the timeline anyway, after loading the bootstrap data, so it
doesn't stay long in that state. I was a bit worried about using a special
value like 0, but it's actually nice that you can distinguish it from any
real LSN value. The unit tests have been using Lsn(0) as the initial start
LSN all along.
Move the responsibility to wait for the WAL to arrive to the callers, and
remove the wait_lsn() calls from the Timeline::get_page_at_lsn() and
friends. We were not totally consistent before, list_rels() was missing the
wait_lsn() call for example.
Closes https://github.com/zenithdb/zenith/issues/521
by binding sockets before daemonization
also use less annoying error reporting by not printing full error
messages for connect errors in first several connection retries
closes#507
In order to exclude problems with synchronizing disk and memory logical
size is not stored in metadata on disk. It is calculated on timeline
"start" by scanning the contents of layered repo and then size is maintained
via an atomic variable.
This patch also adds new endpoint to pageserver http api: branch detail.
It allows retrieval of a particular branch info by its name. Size info
is also added to the response of the endpoint and used in tests.
- Reorder the structs and functions
- Delegate many of the operations in LayerMap to SegEntry. For example,
`LayerMap::insert_open` now looks up the right SegEntry struct, and
then calls `SegEntry::insert_open` on it.
- Use HashMap::entry() function with or_default() to implement the lookups
with less code
Commit 66929ad6fb added a 'generation' number to open segments stored
in the layer map, to distinguish old layers from layers that were
added to the map during checkpoint processing. But it neglected the
OpenSegEntry::cmp() function.
It seems that the cmp() function is never used by BinaryHeap, so this
didn't cause any user-visible bugs (I tried adding a panic() to the
cmp() function and it didn't fire). But it's clearly wrong and we need
to fix it, anyway.
1) Do epoch switch without record from new epoch, immediately after recovery --
--sync-safekeepers mode doesn't generate new records.
2) Fix commit_lsn advancement by taking into account wal we have locally --
setting it further is incorrect.
3) Report it back to walproposer so he knows when sync is done.
4) Remove system id check as it is unknown in sync mode.
And make logging slightly better.
ref #439