mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-09 14:32:57 +00:00
Our shutdown procedure for "pageserver init" was buggy. Firstly, it merely sent the process a SIGKILL, but did not wait for it to actually exit. Normally, it should exit quickly as SIGKILL cannot be caught or ignored by the target process, but it's still asynchronous and the process can still be alive when the kill(2) call returns. Secondly, "neon_local" removed the PID file after sending SIGKILL, even though the process was still running. That hid the first problem: if we didn't remove the PID file, and you start a new pageserver process while the old one is still running, you would get an error when the new process tries to lock the PID file. We've been seeing a lot of "Cannot assign requested address" failures in the CI lately. Our theory is that when we run "pageserver init" immediately followed by "pageserver start", the first process is still running and listening on the port when the second invocation starts up. This commit hopefully fixes the problem. It is generally a bad idea for the "neon_local" to remove the PID file on the child process's behalf. The correct way would be for the server process to remove the PID file, after it has fully shutdown everything else. We don't currently have a robust way to ensure that everything has truly shut down and closed, however. A simpler way is to simply never remove the PID file. It's not necessary to remove the PID file for correctness: we cannot rely on the cleanup to happen anyway, if the server process crashes for example. Because of that, we already have all the logic in place to deal with a stale PID file that belonged to a process that already exited. Let's rely on that on normal shutdown too.