This patch includes attach/detach http endpoints in pageservers. Some
changes in callmemaybe handling inside safekeeper and an integrational
test to check migration with and without load. There are still some
rough edges that will be addressed in follow up patches
Mainly because it has better support for installing the packages from
different python versions.
It also has better dependency resolver than Pipenv. And supports modern
standard for python dependency management. This includes usage of
pyproject.toml for project specific configuration instead of per
tool conf files. See following links for details:
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/https://www.python.org/dev/peps/pep-0518/
* Do not delete layers beyand cutoff LSN
* Update pageserver/src/layered_repository/layer_map.rs
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.
Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.
This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
Log the error and continue. Hopefully it's a transient failure.
This might have been happening in staging earlier, when the safekeeper
had a problem where it opened connections very frequently to issue
"callmemaybe" commands. If you launch too many threads too fast, you might
run out of file descriptors or something. It's not totally clear what
happened, but with commit, at least the page server will continue to run
and accept new connections, if a transient error happens.
'anyhow' crate can include a backtrace in all errors, when the
'backtrace' feature is enabled. Enable it, and change the places that used
'{:#}' or '{}' to '{:?}', so that the backtrace is printed.
A timeline ID is only guaranteed to be unique for a particular tenant,
so you need to use tenant ID + timeline ID as the key, rather than just
timeline ID.
The safekeeper currently makes the same assumption, and we should fix that
too, but this commit just addresses this one case in the page server.
In the passing, reorder some function arguments to be more consistent.
The walkeeper launch two threads for each connection, and uses a guard
object to remove entry from 'replicas' array, when finishes. But only
the background thread held onto the guard object, so if the background
thread finished before the other thread, the array entry would be
removed prematurely, which lead to panic in the check_stop_streaming()
call.
Fixes https://github.com/zenithdb/zenith/issues/1103
Hexalize zids there for better output; since Serde doesn't support several
formats for one struct, on-disk representation is changed as well, make
upgrade.rs cope with it.
to avoid a subtle race condition.
Without safekeeper, walreceiver reconnection can stuck,
because of IO deadlock between walsender auth and regular backend.
* Do not hold timelines lock during GC
refer #1087
* Add gc_cs mutex for preveting creation of new timelines during GC
* Make clippy happy
* Use Mutex<()> instead of Mutex<i32> for GC critical section