mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-14 17:02:56 +00:00
We had a hot debate on whether we should try to make our code cancellation-safe, or just accept that it's not, and make sure that our Futures are driven to completion. The decision is that we drive Futures to completion. This documents the decision, and summarizes the reasoning for that. Discussion that sparked this: https://github.com/neondatabase/neon/pull/4198#discussion_r1190209316
112 lines
4.5 KiB
Markdown
112 lines
4.5 KiB
Markdown
## Thread management
|
|
|
|
The pageserver uses Tokio for handling concurrency. Everything runs in
|
|
Tokio tasks, although some parts are written in blocking style and use
|
|
spawn_blocking().
|
|
|
|
We currently use std blocking functions for disk I/O, however. The
|
|
current model is that we consider disk I/Os to be short enough that we
|
|
perform them while running in a Tokio task. Changing all the disk I/O
|
|
calls to async is a TODO.
|
|
|
|
Each Tokio task is tracked by the `task_mgr` module. It maintains a
|
|
registry of tasks, and which tenant or timeline they are operating
|
|
on.
|
|
|
|
### Handling shutdown
|
|
|
|
When a tenant or timeline is deleted, we need to shut down all tasks
|
|
operating on it, before deleting the data on disk. There's a function,
|
|
`shutdown_tasks`, to request all tasks of a particular tenant or
|
|
timeline to shutdown. It will also wait for them to finish.
|
|
|
|
A task registered in the task registry can check if it has been
|
|
requested to shut down, by calling `is_shutdown_requested()`. There's
|
|
also a `shudown_watcher()` Future that can be used with `tokio::select!`
|
|
or similar, to wake up on shutdown.
|
|
|
|
|
|
### Async cancellation safety
|
|
|
|
In async Rust, futures can be "cancelled" at any await point, by
|
|
dropping the Future. For example, `tokio::select!` returns as soon as
|
|
one of the Futures returns, and drops the others. `tokio::timeout!` is
|
|
another example. In the Rust ecosystem, some functions are
|
|
cancellation-safe, meaning they can be safely dropped without
|
|
side-effects, while others are not. See documentation of
|
|
`tokio::select!` for examples.
|
|
|
|
In the pageserver and safekeeper, async code is *not*
|
|
cancellation-safe by default. Unless otherwise marked, any async
|
|
function that you call cannot be assumed to be async
|
|
cancellation-safe, and must be polled to completion.
|
|
|
|
The downside of non-cancellation safe code is that you have to be very
|
|
careful when using `tokio::select!`, `tokio::timeout!`, and other such
|
|
functions that can cause a Future to be dropped. They can only be used
|
|
with functions that are explicitly documented to be cancellation-safe,
|
|
or you need to spawn a separate task to shield from the cancellation.
|
|
|
|
At the entry points to the code, we also take care to poll futures to
|
|
completion, or shield the rest of the code from surprise cancellations
|
|
by spawning a separate task. The code that handles incoming HTTP
|
|
requests, for example, spawns a separate task for each request,
|
|
because Hyper will drop the request-handling Future if the HTTP
|
|
connection is lost. (FIXME: our HTTP handlers do not do that
|
|
currently, but we should fix that. See [issue
|
|
3478](https://github.com/neondatabase/neon/issues/3478)).
|
|
|
|
|
|
#### How to cancel, then?
|
|
|
|
If our code is not cancellation-safe, how do you cancel long-running
|
|
tasks? Use CancellationTokens.
|
|
|
|
TODO: More details on that. And we have an ongoing discussion on what
|
|
to do if cancellations might come from multiple sources.
|
|
|
|
#### Exceptions
|
|
Some library functions are cancellation-safe, and are explicitly marked
|
|
as such. For example, `utils::seqwait`.
|
|
|
|
#### Rationale
|
|
|
|
The alternative would be to make all async code cancellation-safe,
|
|
unless otherwise marked. That way, you could use `tokio::select!` more
|
|
liberally. The reasons we didn't choose that are explained in this
|
|
section.
|
|
|
|
Writing code in a cancellation-safe manner is tedious, as you need to
|
|
scrutinize every `.await` and ensure that if the `.await` call never
|
|
returns, the system is in a safe, consistent state. In some ways, you
|
|
need to do that with `?` and early `returns`, too, but `.await`s are
|
|
easier to miss. It is also easier to perform cleanup tasks when a
|
|
function returns an `Err` than when an `.await` simply never
|
|
returns. You can use `scopeguard` and Drop guards to perform cleanup
|
|
tasks, but it is more tedious. An `.await` that never returns is more
|
|
similar to a panic.
|
|
|
|
Note that even if you only use building blocks that themselves are
|
|
cancellation-safe, it doesn't mean that the code as whole is
|
|
cancellation-safe. For example, consider the following code:
|
|
|
|
```
|
|
while let Some(i) = work_inbox.recv().await {
|
|
if let Err(_) = results_outbox.send(i).await {
|
|
println!("receiver dropped");
|
|
return;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
It reads messages from one channel, sends them to another channel. If
|
|
this code is cancelled at the `results_outbox.send(i).await`, the
|
|
message read from the receiver is lost. That may or may not be OK,
|
|
depending on the context.
|
|
|
|
Another reason to not require cancellation-safety is historical: we
|
|
already had a lot of async code that was not scrutinized for
|
|
cancellation-safety when this issue was raised. Scrutinizing all
|
|
existing code is no fun.
|