Files
neon/control_plane
Heikki Linnakangas faf1d20e6a Don't remove PID file in neon_local, and wait after "pageserver init". (#2983)
Our shutdown procedure for "pageserver init" was buggy. Firstly, it
merely sent the process a SIGKILL, but did not wait for it to actually
exit. Normally, it should exit quickly as SIGKILL cannot be caught or
ignored by the target process, but it's still asynchronous and the
process can still be alive when the kill(2) call returns. Secondly,
"neon_local" removed the PID file after sending SIGKILL, even though the
process was still running. That hid the first problem: if we didn't
remove the PID file, and you start a new pageserver process while the
old one is still running, you would get an error when the new process
tries to lock the PID file.

We've been seeing a lot of "Cannot assign requested address" failures in
the CI lately. Our theory is that when we run "pageserver init"
immediately followed by "pageserver start", the first process is still
running and listening on the port when the second invocation starts up.
This commit hopefully fixes the problem.

It is generally a bad idea for the "neon_local" to remove the PID file
on the child process's behalf. The correct way would be for the server
process to remove the PID file, after it has fully shutdown everything
else. We don't currently have a robust way to ensure that everything has
truly shut down and closed, however.

A simpler way is to simply never remove the PID file. It's not necessary
to remove the PID file for correctness: we cannot rely on the cleanup to
happen anyway, if the server process crashes for example. Because of
that, we already have all the logic in place to deal with a stale PID
file that belonged to a process that already exited. Let's rely on that
on normal shutdown too.
2022-12-01 16:38:52 +02:00
..
2022-09-14 08:14:05 +03:00