mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-20 06:30:43 +00:00
Before this PR, the `nix::poll::poll` call would stall the executor.
This PR refactors the `walredo::process` module to allow for different
implementations, and adds a new `async` implementation which uses
`tokio::process::ChildStd{in,out}` for IPC.
The `sync` variant remains the default for now; we'll do more testing in
staging and gradual rollout to prod using the config variable.
Performance
-----------
I updated `bench_walredo.rs`, demonstrating that a single `async`-based
walredo manager used by N=1...128 tokio tasks has lower latency and
higher throughput.
I further did manual less-micro-benchmarking in the real pageserver
binary.
Methodology & results are published here:
https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4
tl;dr:
- use pagebench against a pageserver patched to answer getpage request &
small-enough working set to fit into PS PageCache / kernel page cache.
- compare knee in the latency/throughput curve
- N tenants, each 1 pagebench clients
- sync better throughput at N < 30, async better at higher N
- async generally noticable but not much worse p99.X tail latencies
- eyeballing CPU efficiency in htop, `async` seems significantly more
CPU efficient at ca N=[0.5*ncpus, 1.5*ncpus], worse than `sync` outside
of that band
Mental Model For Walredo & Scheduler Interactions
-------------------------------------------------
Walredo is CPU-/DRAM-only work.
This means that as soon as the Pageserver writes to the pipe, the
walredo process becomes runnable.
To the Linux kernel scheduler, the `$ncpus` executor threads and the
walredo process thread are just `struct task_struct`, and it will divide
CPU time fairly among them.
In `sync` mode, there are always `$ncpus` runnable `struct task_struct`
because the executor thread blocks while `walredo` runs, and the
executor thread becomes runnable when the `walredo` process is done
handling the request.
In `async` mode, the executor threads remain runnable unless there are
no more runnable tokio tasks, which is unlikely in a production
pageserver.
The above means that in `sync` mode, there is an implicit concurrency
limit on concurrent walredo requests (`$num_runtimes *
$num_executor_threads_per_runtime`).
And executor threads do not compete in the Linux kernel scheduler for
CPU time, due to the blocked-runnable-ping-pong.
In `async` mode, there is no concurrency limit, and the walredo tasks
compete with the executor threads for CPU time in the kernel scheduler.
If we're not CPU-bound, `async` has a pipelining and hence throughput
advantage over `sync` because one executor thread can continue
processing requests while a walredo request is in flight.
If we're CPU-bound, under a fair CPU scheduler, the *fixed* number of
executor threads has to share CPU time with the aggregate of walredo
processes.
It's trivial to reason about this in `sync` mode due to the
blocked-runnable-ping-pong.
In `async` mode, at 100% CPU, the system arrives at some (potentially
sub-optiomal) equilibrium where the executor threads get just enough CPU
time to fill up the remaining CPU time with runnable walredo process.
Why `async` mode Doesn't Limit Walredo Concurrency
--------------------------------------------------
To control that equilibrium in `async` mode, one may add a tokio
semaphore to limit the number of in-flight walredo requests.
However, the placement of such a semaphore is non-trivial because it
means that tasks queuing up behind it hold on to their request-scoped
allocations.
In the case of walredo, that might be the entire reconstruct data.
We don't limit the number of total inflight Timeline::get (we only
throttle admission).
So, that queue might lead to an OOM.
The alternative is to acquire the semaphore permit *before* collecting
reconstruct data.
However, what if we need to on-demand download?
A combination of semaphores might help: one for reconstruct data, one
for walredo.
The reconstruct data semaphore permit is dropped after acquiring the
walredo semaphore permit.
This scheme effectively enables both a limit on in-flight reconstruct
data and walredo concurrency.
However, sizing the amount of permits for the semaphores is tricky:
- Reconstruct data retrieval is a mix of disk IO and CPU work.
- If we need to do on-demand downloads, it's network IO + disk IO + CPU
work.
- At this time, we have no good data on how the wall clock time is
distributed.
It turns out that, in my benchmarking, the system worked fine without a
semaphore. So, we're shipping async walredo without one for now.
Future Work
-----------
We will do more testing of `async` mode and gradual rollout to prod
using the config flag.
Once that is done, we'll remove `sync` mode to avoid the temporary code
duplication introduced by this PR.
The flag will be removed.
The `wait()` for the child process to exit is still synchronous; the
comment [here](
655d3b6468/pageserver/src/walredo.rs (L294-L306))
is still a valid argument in favor of that.
The `sync` mode had another implicit advantage: from tokio's
perspective, the calling task was using up coop budget.
But with `async` mode, that's no longer the case -- to tokio, the writes
to the child process pipe look like IO.
We could/should inform tokio about the CPU time budget consumed by the
task to achieve fairness similar to `sync`.
However, the [runtime function for this is
`tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html).
Refs
----
refs #6628
refs https://github.com/neondatabase/neon/issues/2975
98 lines
2.4 KiB
Rust
98 lines
2.4 KiB
Rust
use std::time::Duration;
|
|
|
|
use bytes::Bytes;
|
|
use pageserver_api::{reltag::RelTag, shard::TenantShardId};
|
|
use utils::lsn::Lsn;
|
|
|
|
use crate::{config::PageServerConf, walrecord::NeonWalRecord};
|
|
|
|
mod no_leak_child;
|
|
/// The IPC protocol that pageserver and walredo process speak over their shared pipe.
|
|
mod protocol;
|
|
|
|
mod process_impl {
|
|
pub(super) mod process_async;
|
|
pub(super) mod process_std;
|
|
}
|
|
|
|
#[derive(
|
|
Clone,
|
|
Copy,
|
|
Debug,
|
|
PartialEq,
|
|
Eq,
|
|
strum_macros::EnumString,
|
|
strum_macros::Display,
|
|
strum_macros::IntoStaticStr,
|
|
serde_with::DeserializeFromStr,
|
|
serde_with::SerializeDisplay,
|
|
)]
|
|
#[strum(serialize_all = "kebab-case")]
|
|
#[repr(u8)]
|
|
pub enum Kind {
|
|
Sync,
|
|
Async,
|
|
}
|
|
|
|
pub(crate) enum Process {
|
|
Sync(process_impl::process_std::WalRedoProcess),
|
|
Async(process_impl::process_async::WalRedoProcess),
|
|
}
|
|
|
|
impl Process {
|
|
#[inline(always)]
|
|
pub fn launch(
|
|
conf: &'static PageServerConf,
|
|
tenant_shard_id: TenantShardId,
|
|
pg_version: u32,
|
|
) -> anyhow::Result<Self> {
|
|
Ok(match conf.walredo_process_kind {
|
|
Kind::Sync => Self::Sync(process_impl::process_std::WalRedoProcess::launch(
|
|
conf,
|
|
tenant_shard_id,
|
|
pg_version,
|
|
)?),
|
|
Kind::Async => Self::Async(process_impl::process_async::WalRedoProcess::launch(
|
|
conf,
|
|
tenant_shard_id,
|
|
pg_version,
|
|
)?),
|
|
})
|
|
}
|
|
|
|
#[inline(always)]
|
|
pub(crate) async fn apply_wal_records(
|
|
&self,
|
|
rel: RelTag,
|
|
blknum: u32,
|
|
base_img: &Option<Bytes>,
|
|
records: &[(Lsn, NeonWalRecord)],
|
|
wal_redo_timeout: Duration,
|
|
) -> anyhow::Result<Bytes> {
|
|
match self {
|
|
Process::Sync(p) => {
|
|
p.apply_wal_records(rel, blknum, base_img, records, wal_redo_timeout)
|
|
.await
|
|
}
|
|
Process::Async(p) => {
|
|
p.apply_wal_records(rel, blknum, base_img, records, wal_redo_timeout)
|
|
.await
|
|
}
|
|
}
|
|
}
|
|
|
|
pub(crate) fn id(&self) -> u32 {
|
|
match self {
|
|
Process::Sync(p) => p.id(),
|
|
Process::Async(p) => p.id(),
|
|
}
|
|
}
|
|
|
|
pub(crate) fn kind(&self) -> Kind {
|
|
match self {
|
|
Process::Sync(_) => Kind::Sync,
|
|
Process::Async(_) => Kind::Async,
|
|
}
|
|
}
|
|
}
|