mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-11 07:22:55 +00:00
Before this PR, the following race condition existed: ``` T1: does the apply_wal_records() call and gets back an error T2: does the apply_wal_records() call and gets back an error T2: does the kill_and_shutdown T2: new loop iteration T2: launches new walredo process T1: does the kill_and_shutdown of the new process ``` That last step is wrong, T2 already did the kill_and_shutdown. The symptom of this race condition was that T2 would observe an error when it tried to do something with the process after T1 killed it. For example, but not limited to: `POLLHUP` / `"WAL redo process closed its stderr unexpectedly"`. The fix in this PR is the following: * Use Arc to represent walredo processes. The Arc lives at least as long as the walredo process. * Use Arc::ptr_eq to determine whether to kill the process or not. The price is an additional RwLock to protect the new `redo_process` field that holds the Arc. I guess that could perhaps be an atomic pointer swap some day. But, let's get one race fixed without risking introducing a new one. The use of Arc/drop is also not super great here because it now allows for an unlimited number of to-be-killed processes to exist concurrently. See the various `NB` comments above `drop(proc)` for why it's "ok" right now due to the blocking `wait` inside `drop`. Note: an earlier fix attempt was https://github.com/neondatabase/neon/pull/5545 where we apply_batch_postgres would compare stdout_fd for equality. That's incorrect because the kernel can reuse the file descriptor when T2 launches the new process. Details: https://github.com/neondatabase/neon/pull/5545#pullrequestreview-1676589373