mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-14 08:52:56 +00:00
Before this PR, the `nix::poll::poll` call would stall the executor.
This PR refactors the `walredo::process` module to allow for different
implementations, and adds a new `async` implementation which uses
`tokio::process::ChildStd{in,out}` for IPC.
The `sync` variant remains the default for now; we'll do more testing in
staging and gradual rollout to prod using the config variable.
Performance
-----------
I updated `bench_walredo.rs`, demonstrating that a single `async`-based
walredo manager used by N=1...128 tokio tasks has lower latency and
higher throughput.
I further did manual less-micro-benchmarking in the real pageserver
binary.
Methodology & results are published here:
https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4
tl;dr:
- use pagebench against a pageserver patched to answer getpage request &
small-enough working set to fit into PS PageCache / kernel page cache.
- compare knee in the latency/throughput curve
- N tenants, each 1 pagebench clients
- sync better throughput at N < 30, async better at higher N
- async generally noticable but not much worse p99.X tail latencies
- eyeballing CPU efficiency in htop, `async` seems significantly more
CPU efficient at ca N=[0.5*ncpus, 1.5*ncpus], worse than `sync` outside
of that band
Mental Model For Walredo & Scheduler Interactions
-------------------------------------------------
Walredo is CPU-/DRAM-only work.
This means that as soon as the Pageserver writes to the pipe, the
walredo process becomes runnable.
To the Linux kernel scheduler, the `$ncpus` executor threads and the
walredo process thread are just `struct task_struct`, and it will divide
CPU time fairly among them.
In `sync` mode, there are always `$ncpus` runnable `struct task_struct`
because the executor thread blocks while `walredo` runs, and the
executor thread becomes runnable when the `walredo` process is done
handling the request.
In `async` mode, the executor threads remain runnable unless there are
no more runnable tokio tasks, which is unlikely in a production
pageserver.
The above means that in `sync` mode, there is an implicit concurrency
limit on concurrent walredo requests (`$num_runtimes *
$num_executor_threads_per_runtime`).
And executor threads do not compete in the Linux kernel scheduler for
CPU time, due to the blocked-runnable-ping-pong.
In `async` mode, there is no concurrency limit, and the walredo tasks
compete with the executor threads for CPU time in the kernel scheduler.
If we're not CPU-bound, `async` has a pipelining and hence throughput
advantage over `sync` because one executor thread can continue
processing requests while a walredo request is in flight.
If we're CPU-bound, under a fair CPU scheduler, the *fixed* number of
executor threads has to share CPU time with the aggregate of walredo
processes.
It's trivial to reason about this in `sync` mode due to the
blocked-runnable-ping-pong.
In `async` mode, at 100% CPU, the system arrives at some (potentially
sub-optiomal) equilibrium where the executor threads get just enough CPU
time to fill up the remaining CPU time with runnable walredo process.
Why `async` mode Doesn't Limit Walredo Concurrency
--------------------------------------------------
To control that equilibrium in `async` mode, one may add a tokio
semaphore to limit the number of in-flight walredo requests.
However, the placement of such a semaphore is non-trivial because it
means that tasks queuing up behind it hold on to their request-scoped
allocations.
In the case of walredo, that might be the entire reconstruct data.
We don't limit the number of total inflight Timeline::get (we only
throttle admission).
So, that queue might lead to an OOM.
The alternative is to acquire the semaphore permit *before* collecting
reconstruct data.
However, what if we need to on-demand download?
A combination of semaphores might help: one for reconstruct data, one
for walredo.
The reconstruct data semaphore permit is dropped after acquiring the
walredo semaphore permit.
This scheme effectively enables both a limit on in-flight reconstruct
data and walredo concurrency.
However, sizing the amount of permits for the semaphores is tricky:
- Reconstruct data retrieval is a mix of disk IO and CPU work.
- If we need to do on-demand downloads, it's network IO + disk IO + CPU
work.
- At this time, we have no good data on how the wall clock time is
distributed.
It turns out that, in my benchmarking, the system worked fine without a
semaphore. So, we're shipping async walredo without one for now.
Future Work
-----------
We will do more testing of `async` mode and gradual rollout to prod
using the config flag.
Once that is done, we'll remove `sync` mode to avoid the temporary code
duplication introduced by this PR.
The flag will be removed.
The `wait()` for the child process to exit is still synchronous; the
comment [here](
655d3b6468/pageserver/src/walredo.rs (L294-L306))
is still a valid argument in favor of that.
The `sync` mode had another implicit advantage: from tokio's
perspective, the calling task was using up coop budget.
But with `async` mode, that's no longer the case -- to tokio, the writes
to the child process pipe look like IO.
We could/should inform tokio about the CPU time budget consumed by the
task to achieve fairness similar to `sync`.
However, the [runtime function for this is
`tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html).
Refs
----
refs #6628
refs https://github.com/neondatabase/neon/issues/2975
178 lines
5.7 KiB
Rust
178 lines
5.7 KiB
Rust
//! `utils` is intended to be a place to put code that is shared
|
|
//! between other crates in this repository.
|
|
#![deny(clippy::undocumented_unsafe_blocks)]
|
|
|
|
pub mod backoff;
|
|
|
|
/// `Lsn` type implements common tasks on Log Sequence Numbers
|
|
pub mod lsn;
|
|
/// SeqWait allows waiting for a future sequence number to arrive
|
|
pub mod seqwait;
|
|
|
|
/// A simple Read-Copy-Update implementation.
|
|
pub mod simple_rcu;
|
|
|
|
/// append only ordered map implemented with a Vec
|
|
pub mod vec_map;
|
|
|
|
pub mod bin_ser;
|
|
|
|
// helper functions for creating and fsyncing
|
|
pub mod crashsafe;
|
|
|
|
// common authentication routines
|
|
pub mod auth;
|
|
|
|
// utility functions and helper traits for unified unique id generation/serialization etc.
|
|
pub mod id;
|
|
|
|
mod hex;
|
|
pub use hex::Hex;
|
|
|
|
// http endpoint utils
|
|
pub mod http;
|
|
|
|
// definition of the Generation type for pageserver attachment APIs
|
|
pub mod generation;
|
|
|
|
// common log initialisation routine
|
|
pub mod logging;
|
|
|
|
pub mod lock_file;
|
|
pub mod pid_file;
|
|
|
|
// Misc
|
|
pub mod accum;
|
|
pub mod shutdown;
|
|
|
|
// Utility for binding TcpListeners with proper socket options.
|
|
pub mod tcp_listener;
|
|
|
|
// Utility for putting a raw file descriptor into non-blocking mode
|
|
pub mod nonblock;
|
|
|
|
// Default signal handling
|
|
pub mod sentry_init;
|
|
pub mod signals;
|
|
|
|
pub mod fs_ext;
|
|
|
|
pub mod history_buffer;
|
|
|
|
pub mod measured_stream;
|
|
|
|
pub mod serde_percent;
|
|
pub mod serde_regex;
|
|
pub mod serde_system_time;
|
|
|
|
pub mod pageserver_feedback;
|
|
|
|
pub mod postgres_client;
|
|
|
|
pub mod tracing_span_assert;
|
|
|
|
pub mod rate_limit;
|
|
|
|
/// Simple once-barrier and a guard which keeps barrier awaiting.
|
|
pub mod completion;
|
|
|
|
/// Reporting utilities
|
|
pub mod error;
|
|
|
|
/// async timeout helper
|
|
pub mod timeout;
|
|
|
|
pub mod sync;
|
|
|
|
pub mod failpoint_support;
|
|
|
|
pub mod yielding_loop;
|
|
|
|
pub mod zstd;
|
|
|
|
pub mod env;
|
|
|
|
pub mod poison;
|
|
|
|
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
|
|
///
|
|
/// we have several cases:
|
|
/// * building locally from git repo
|
|
/// * building in CI from git repo
|
|
/// * building in docker (either in CI or locally)
|
|
///
|
|
/// One thing to note is that .git is not available in docker (and it is bad to include it there).
|
|
/// When building locally, the `git_version` is used to query .git. When building on CI and docker,
|
|
/// we don't build the actual PR branch commits, but always a "phantom" would be merge commit to
|
|
/// the target branch -- the actual PR commit from which we build from is supplied as GIT_VERSION
|
|
/// environment variable.
|
|
///
|
|
/// We ended up with this compromise between phantom would be merge commits vs. pull request branch
|
|
/// heads due to old logs becoming more reliable (github could gc the phantom merge commit
|
|
/// anytime) in #4641.
|
|
///
|
|
/// To avoid running buildscript every recompilation, we use rerun-if-env-changed option.
|
|
/// So the build script will be run only when GIT_VERSION envvar has changed.
|
|
///
|
|
/// Why not to use buildscript to get git commit sha directly without procmacro from different crate?
|
|
/// Caching and workspaces complicates that. In case `utils` is not
|
|
/// recompiled due to caching then version may become outdated.
|
|
/// git_version crate handles that case by introducing a dependency on .git internals via include_bytes! macro,
|
|
/// so if we changed the index state git_version will pick that up and rerun the macro.
|
|
///
|
|
/// Note that with git_version prefix is `git:` and in case of git version from env its `git-env:`.
|
|
///
|
|
/// #############################################################################################
|
|
/// TODO this macro is not the way the library is intended to be used, see <https://github.com/neondatabase/neon/issues/1565> for details.
|
|
/// We use `cachepot` to reduce our current CI build times: <https://github.com/neondatabase/cloud/pull/1033#issuecomment-1100935036>
|
|
/// Yet, it seems to ignore the GIT_VERSION env variable, passed to Docker build, even with build.rs that contains
|
|
/// `println!("cargo:rerun-if-env-changed=GIT_VERSION");` code for cachepot cache invalidation.
|
|
/// The problem needs further investigation and regular `const` declaration instead of a macro.
|
|
#[macro_export]
|
|
macro_rules! project_git_version {
|
|
($const_identifier:ident) => {
|
|
// this should try GIT_VERSION first only then git_version::git_version!
|
|
const $const_identifier: &::core::primitive::str = {
|
|
const __COMMIT_FROM_GIT: &::core::primitive::str = git_version::git_version! {
|
|
prefix = "",
|
|
fallback = "unknown",
|
|
args = ["--abbrev=40", "--always", "--dirty=-modified"] // always use full sha
|
|
};
|
|
|
|
const __ARG: &[&::core::primitive::str; 2] = &match ::core::option_env!("GIT_VERSION") {
|
|
::core::option::Option::Some(x) => ["git-env:", x],
|
|
::core::option::Option::None => ["git:", __COMMIT_FROM_GIT],
|
|
};
|
|
|
|
$crate::__const_format::concatcp!(__ARG[0], __ARG[1])
|
|
};
|
|
};
|
|
}
|
|
|
|
/// This is a shortcut to embed build tag into binaries and avoid copying the same build script to all packages
|
|
#[macro_export]
|
|
macro_rules! project_build_tag {
|
|
($const_identifier:ident) => {
|
|
const $const_identifier: &::core::primitive::str = {
|
|
const __ARG: &[&::core::primitive::str; 2] = &match ::core::option_env!("BUILD_TAG") {
|
|
::core::option::Option::Some(x) => ["build_tag-env:", x],
|
|
::core::option::Option::None => ["build_tag:", ""],
|
|
};
|
|
|
|
$crate::__const_format::concatcp!(__ARG[0], __ARG[1])
|
|
};
|
|
};
|
|
}
|
|
|
|
/// Re-export for `project_git_version` macro
|
|
#[doc(hidden)]
|
|
pub use const_format as __const_format;
|
|
|
|
/// Same as `assert!`, but evaluated during compilation and gets optimized out in runtime.
|
|
#[macro_export]
|
|
macro_rules! const_assert {
|
|
($($args:tt)*) => {
|
|
const _: () = assert!($($args)*);
|
|
};
|
|
}
|