mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-14 08:52:56 +00:00
## Refs - Epic: https://github.com/neondatabase/neon/issues/9378 Co-authored-by: Vlad Lazar <vlad@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> ## Problem The read path does its IOs sequentially. This means that if N values need to be read to reconstruct a page, we will do N IOs and getpage latency is `O(N*IoLatency)`. ## Solution With this PR we gain the ability to issue IO concurrently within one layer visit **and** to move on to the next layer without waiting for IOs from the previous visit to complete. This is an evolved version of the work done at the Lisbon hackathon, cf https://github.com/neondatabase/neon/pull/9002. ## Design ### `will_init` now sourced from disk btree index keys On the algorithmic level, the only change is that the `get_values_reconstruct_data` now sources `will_init` from the disk btree index key (which is PS-page_cache'd), instead of from the `Value`, which is only available after the IO completes. ### Concurrent IOs, Submission & Completion To separate IO submission from waiting for its completion, while simultaneously feature-gating the change, we introduce the notion of an `IoConcurrency` struct through which IO futures are "spawned". An IO is an opaque future, and waiting for completions is handled through `tokio::sync::oneshot` channels. The oneshot Receiver's take the place of the `img` and `records` fields inside `VectoredValueReconstructState`. When we're done visiting all the layers and submitting all the IOs along the way we concurrently `collect_pending_ios` for each value, which means for each value there is a future that awaits all the oneshot receivers and then calls into walredo to reconstruct the page image. Walredo is now invoked concurrently for each value instead of sequentially. Walredo itself remains unchanged. The spawned IO futures are driven to completion by a sidecar tokio task that is separate from the task that performs all the layer visiting and spawning of IOs. That tasks receives the IO futures via an unbounded mpsc channel and drives them to completion inside a `FuturedUnordered`. (The behavior from before this PR is available through `IoConcurrency::Sequential`, which awaits the IO futures in place, without "spawning" or "submitting" them anywhere.) #### Alternatives Explored A few words on the rationale behind having a sidecar *task* and what alternatives were considered. One option is to queue up all IO futures in a FuturesUnordered that is polled the first time when we `collect_pending_ios`. Firstly, the IO futures are opaque, compiler-generated futures that need to be polled at least once to submit their IO. "At least once" because tokio-epoll-uring may not be able to submit the IO to the kernel on first poll right away. Second, there are deadlocks if we don't drive the IO futures to completion independently of the spawning task. The reason is that both the IO futures and the spawning task may hold some _and_ try to acquire _more_ shared limited resources. For example, both spawning task and IO future may try to acquire * a VirtualFile file descriptor cache slot async mutex (observed during impl) * a tokio-epoll-uring submission slot (observed during impl) * a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future) Another option is to spawn a short-lived `tokio::task` for each IO future. We implemented and benchmarked it during development, but found little throughput improvement and moderate mean & tail latency degradation. Concerns about pressure on the tokio scheduler made us discard this variant. The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct. However, 1. the opaque futures approach taken in this PR allows leaving the existing code unchanged, which 2. allows us to implement the `IoConcurrency::Sequential` mode for feature-gating the change. Once the new mode sidecar task implementation is rolled out everywhere, and `::Sequential` removed, we can think about a descriptive submission & completion interface. The problems around deadlocks pointed out earlier will need to be solved then. For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots. The latter has been drafted in https://github.com/neondatabase/tokio-epoll-uring/pull/63. See the lengthy doc comment on `spawn_io()` for more details. ### Error handling There are two error classes during reconstruct data retrieval: * traversal errors: index lookup, move to next layer, and the like * value read IO errors A traversal error fails the entire get_vectored request, as before this PR. A value read error only fails that value. In any case, we preserve the existing behavior that once `get_vectored` returns, all IOs are done. Panics and failing to poll `get_vectored` to completion will leave the IOs dangling, which is safe but shouldn't happen, and so, a rate-limited log statement will be emitted at warning level. There is a doc comment on `collect_pending_ios` giving more code-level details and rationale. ### Feature Gating The new behavior is opt-in via pageserver config. The `Sequential` mode is the default. The only significant change in `Sequential` mode compared to before this PR is the buffering of results in the `oneshot`s. ## Code-Level Changes Prep work: * Make `GateGuard` clonable. Core Feature: * Traversal code: track `will_init` in `BlobMeta` and source it from the Delta/Image/InMemory layer index, instead of determining `will_init` after we've read the value. This avoids having to read the value to determine whether traversal can stop. * Introduce `IoConcurrency` & its sidecar task. * `IoConcurrency` is the clonable handle. * It connects to the sidecar task via an `mpsc`. * Plumb through `IoConcurrency` from high level code to the individual layer implementations' `get_values_reconstruct_data`. We piggy-back on the `ValuesReconstructState` for this. * The sidecar task should be long-lived, so, `IoConcurrency` needs to be rooted up "high" in the call stack. * Roots as of this PR: * `page_service`: outside of pagestream loop * `create_image_layers`: when it is called * `basebackup`(only auxfiles + replorigin + SLRU segments) * Code with no roots that uses `IoConcurrency::sequential` * any `Timeline::get` call * `collect_keyspace` is a good example * follow-up: https://github.com/neondatabase/neon/issues/10460 * `TimelineAdaptor` code used by the compaction simulator, unused in practive * `ingest_xlog_dbase_create` * Transform Delta/Image/InMemoryLayer to * do their values IO in a distinct `async {}` block * extend the residence of the Delta/Image layer until the IO is done * buffer their results in a `oneshot` channel instead of straight in `ValuesReconstructState` * the `oneshot` channel is wrapped in `OnDiskValueIo` / `OnDiskValueIoWaiter` types that aid in expressiveness and are used to keep track of in-flight IOs so we can print warnings if we leave them dangling. * Change `ValuesReconstructState` to hold the receiving end of the `oneshot` channel aka `OnDiskValueIoWaiter`. * Change `get_vectored_impl` to `collect_pending_ios` and issue walredo concurrently, in a `FuturesUnordered`. Testing / Benchmarking: * Support queue-depth in pagebench for manual benchmarkinng. * Add test suite support for setting concurrency mode ps config field via a) an env var and b) via NeonEnvBuilder. * Hacky helper to have sidecar-based IoConcurrency in tests. This will be cleaned up later. More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod. Some intermediate helpers for manual benchmarking have been preserved in https://github.com/neondatabase/neon/pull/10466 and will be landed in later PRs. (L0 layer stack generator!) Drive-By: * test suite actually didn't enable batching by default because `config.compatibility_neon_binpath` is always Truthy in our CI environment => https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309 * initial logical size calculation wasn't always polled to completion, which was surfaced through the added WARN logs emitted when dropping a `ValuesReconstructState` that still has inflight IOs. * remove the timing histograms `pageserver_getpage_get_reconstruct_data_seconds` and `pageserver_getpage_reconstruct_seconds` because with planning, value read IO, and walredo happening concurrently, one can no longer attribute latency to any one of them; we'll revisit this when Vlad's work on tracing/sampling through RequestContext lands. * remove code related to `get_cached_lsn()`. The logic around this has been dead at runtime for a long time, ever since the removal of the materialized page cache in #8105. ## Testing Unit tests use the sidecar task by default and run both modes in CI. Python regression tests and benchmarks also use the sidecar task by default. We'll test more in staging and possibly preprod. # Future Work Please refer to the parent epic for the full plan. The next step will be to fold the plumbing of IoConcurrency into RequestContext so that the function signatures get cleaned up. Once `Sequential` isn't used anymore, we can take the next big leap which is replacing the opaque IOs with structs that have well-defined semantics. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>
297 lines
9.8 KiB
Rust
297 lines
9.8 KiB
Rust
use std::{
|
|
sync::{
|
|
atomic::{AtomicBool, Ordering},
|
|
Arc,
|
|
},
|
|
time::Duration,
|
|
};
|
|
|
|
/// Gates are a concurrency helper, primarily used for implementing safe shutdown.
|
|
///
|
|
/// Users of a resource call `enter()` to acquire a GateGuard, and the owner of
|
|
/// the resource calls `close()` when they want to ensure that all holders of guards
|
|
/// have released them, and that no future guards will be issued.
|
|
pub struct Gate {
|
|
inner: Arc<GateInner>,
|
|
}
|
|
|
|
impl std::fmt::Debug for Gate {
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
|
f.debug_struct("Gate")
|
|
// use this for identification
|
|
.field("ptr", &Arc::as_ptr(&self.inner))
|
|
.field("inner", &self.inner)
|
|
.finish()
|
|
}
|
|
}
|
|
|
|
struct GateInner {
|
|
sem: tokio::sync::Semaphore,
|
|
closing: std::sync::atomic::AtomicBool,
|
|
}
|
|
|
|
impl std::fmt::Debug for GateInner {
|
|
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
|
let avail = self.sem.available_permits();
|
|
|
|
let guards = u32::try_from(avail)
|
|
.ok()
|
|
// the sem only supports 32-bit ish amount, but lets play it safe
|
|
.and_then(|x| Gate::MAX_UNITS.checked_sub(x));
|
|
|
|
let closing = self.closing.load(Ordering::Relaxed);
|
|
|
|
if let Some(guards) = guards {
|
|
f.debug_struct("Gate")
|
|
.field("remaining_guards", &guards)
|
|
.field("closing", &closing)
|
|
.finish()
|
|
} else {
|
|
f.debug_struct("Gate")
|
|
.field("avail_permits", &avail)
|
|
.field("closing", &closing)
|
|
.finish()
|
|
}
|
|
}
|
|
}
|
|
|
|
/// RAII guard for a [`Gate`]: as long as this exists, calls to [`Gate::close`] will
|
|
/// not complete.
|
|
#[derive(Debug)]
|
|
pub struct GateGuard {
|
|
// Record the span where the gate was entered, so that we can identify who was blocking Gate::close
|
|
span_at_enter: tracing::Span,
|
|
gate: Arc<GateInner>,
|
|
}
|
|
|
|
impl GateGuard {
|
|
pub fn try_clone(&self) -> Result<Self, GateError> {
|
|
Gate::enter_impl(self.gate.clone())
|
|
}
|
|
}
|
|
|
|
impl Drop for GateGuard {
|
|
fn drop(&mut self) {
|
|
if self.gate.closing.load(Ordering::Relaxed) {
|
|
self.span_at_enter.in_scope(
|
|
|| tracing::info!(gate = ?Arc::as_ptr(&self.gate), "kept the gate from closing"),
|
|
);
|
|
}
|
|
|
|
// when the permit was acquired, it was forgotten to allow us to manage it's lifecycle
|
|
// manually, so "return" the permit now.
|
|
self.gate.sem.add_permits(1);
|
|
}
|
|
}
|
|
|
|
#[derive(Debug, thiserror::Error)]
|
|
pub enum GateError {
|
|
#[error("gate is closed")]
|
|
GateClosed,
|
|
}
|
|
|
|
impl Default for Gate {
|
|
fn default() -> Self {
|
|
Self {
|
|
inner: Arc::new(GateInner {
|
|
sem: tokio::sync::Semaphore::new(Self::MAX_UNITS as usize),
|
|
closing: AtomicBool::new(false),
|
|
}),
|
|
}
|
|
}
|
|
}
|
|
|
|
impl Gate {
|
|
const MAX_UNITS: u32 = u32::MAX;
|
|
|
|
/// Acquire a guard that will prevent close() calls from completing. If close()
|
|
/// was already called, this will return an error which should be interpreted
|
|
/// as "shutting down".
|
|
///
|
|
/// This function would typically be used from e.g. request handlers. While holding
|
|
/// the guard returned from this function, it is important to respect a CancellationToken
|
|
/// to avoid blocking close() indefinitely: typically types that contain a Gate will
|
|
/// also contain a CancellationToken.
|
|
pub fn enter(&self) -> Result<GateGuard, GateError> {
|
|
Self::enter_impl(self.inner.clone())
|
|
}
|
|
|
|
fn enter_impl(gate: Arc<GateInner>) -> Result<GateGuard, GateError> {
|
|
let permit = gate.sem.try_acquire().map_err(|_| GateError::GateClosed)?;
|
|
|
|
// we now have the permit, let's disable the normal raii functionality and leave
|
|
// "returning" the permit to our GateGuard::drop.
|
|
//
|
|
// this is done to avoid the need for multiple Arcs (one for semaphore, next for other
|
|
// fields).
|
|
permit.forget();
|
|
|
|
Ok(GateGuard {
|
|
span_at_enter: tracing::Span::current(),
|
|
gate,
|
|
})
|
|
}
|
|
|
|
/// Types with a shutdown() method and a gate should call this method at the
|
|
/// end of shutdown, to ensure that all GateGuard holders are done.
|
|
///
|
|
/// This will wait for all guards to be destroyed. For this to complete promptly, it is
|
|
/// important that the holders of such guards are respecting a CancellationToken which has
|
|
/// been cancelled before entering this function.
|
|
pub async fn close(&self) {
|
|
let started_at = std::time::Instant::now();
|
|
let mut do_close = std::pin::pin!(self.do_close());
|
|
|
|
// with 1s we rarely saw anything, let's try if we get more gate closing reasons with 100ms
|
|
let nag_after = Duration::from_millis(100);
|
|
|
|
let Err(_timeout) = tokio::time::timeout(nag_after, &mut do_close).await else {
|
|
return;
|
|
};
|
|
|
|
tracing::info!(
|
|
gate = ?self.as_ptr(),
|
|
elapsed_ms = started_at.elapsed().as_millis(),
|
|
"closing is taking longer than expected"
|
|
);
|
|
|
|
// close operation is not trying to be cancellation safe as pageserver does not need it.
|
|
//
|
|
// note: "closing" is not checked in Gate::enter -- it exists just for observability,
|
|
// dropping of GateGuard after this will log who they were.
|
|
self.inner.closing.store(true, Ordering::Relaxed);
|
|
|
|
do_close.await;
|
|
|
|
tracing::info!(
|
|
gate = ?self.as_ptr(),
|
|
elapsed_ms = started_at.elapsed().as_millis(),
|
|
"close completed"
|
|
);
|
|
}
|
|
|
|
/// Used as an identity of a gate. This identity will be resolved to something useful when
|
|
/// it's actually closed in a hopefully sensible `tracing::Span` which will describe it even
|
|
/// more.
|
|
///
|
|
/// `GateGuard::drop` also logs this pointer when it has realized it has been keeping the gate
|
|
/// open for too long.
|
|
fn as_ptr(&self) -> *const GateInner {
|
|
Arc::as_ptr(&self.inner)
|
|
}
|
|
|
|
/// Check if [`Self::close()`] has finished waiting for all [`Self::enter()`] users to finish. This
|
|
/// is usually analoguous for "Did shutdown finish?" for types that include a Gate, whereas checking
|
|
/// the CancellationToken on such types is analogous to "Did shutdown start?"
|
|
pub fn close_complete(&self) -> bool {
|
|
self.inner.sem.is_closed()
|
|
}
|
|
|
|
#[tracing::instrument(level = tracing::Level::DEBUG, skip_all, fields(gate = ?self.as_ptr()))]
|
|
async fn do_close(&self) {
|
|
tracing::debug!("Closing Gate...");
|
|
|
|
match self.inner.sem.acquire_many(Self::MAX_UNITS).await {
|
|
Ok(_permit) => {
|
|
// While holding all units, close the semaphore. All subsequent calls to enter() will fail.
|
|
self.inner.sem.close();
|
|
}
|
|
Err(_closed) => {
|
|
// Semaphore closed: we are the only function that can do this, so it indicates a double-call.
|
|
// This is legal. Timeline::shutdown for example is not protected from being called more than
|
|
// once.
|
|
tracing::debug!("Double close")
|
|
}
|
|
}
|
|
tracing::debug!("Closed Gate.")
|
|
}
|
|
}
|
|
|
|
#[cfg(test)]
|
|
mod tests {
|
|
use super::*;
|
|
|
|
#[tokio::test]
|
|
async fn close_unused() {
|
|
// Having taken no guards, we should not be blocked in close
|
|
let gate = Gate::default();
|
|
gate.close().await;
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn close_idle() {
|
|
// If a guard is dropped before entering, close should not be blocked
|
|
let gate = Gate::default();
|
|
let guard = gate.enter().unwrap();
|
|
drop(guard);
|
|
gate.close().await;
|
|
|
|
// Entering a closed guard fails
|
|
gate.enter().expect_err("enter should fail after close");
|
|
}
|
|
|
|
#[tokio::test(start_paused = true)]
|
|
async fn close_busy_gate() {
|
|
let gate = Gate::default();
|
|
let forever = Duration::from_secs(24 * 7 * 365);
|
|
|
|
let guard =
|
|
tracing::info_span!("i am holding back the gate").in_scope(|| gate.enter().unwrap());
|
|
|
|
let mut close_fut = std::pin::pin!(gate.close());
|
|
|
|
// Close should be waiting for guards to drop
|
|
tokio::time::timeout(forever, &mut close_fut)
|
|
.await
|
|
.unwrap_err();
|
|
|
|
// Attempting to enter() should fail, even though close isn't done yet.
|
|
gate.enter()
|
|
.expect_err("enter should fail after entering close");
|
|
|
|
// this will now log, which we cannot verify except manually
|
|
drop(guard);
|
|
|
|
// Guard is gone, close should finish
|
|
close_fut.await;
|
|
|
|
// Attempting to enter() is still forbidden
|
|
gate.enter().expect_err("enter should fail finishing close");
|
|
}
|
|
|
|
#[tokio::test(start_paused = true)]
|
|
async fn clone_gate_guard() {
|
|
let gate = Gate::default();
|
|
let forever = Duration::from_secs(24 * 7 * 365);
|
|
|
|
let guard1 = gate.enter().expect("gate isn't closed");
|
|
|
|
let guard2 = guard1.try_clone().expect("gate isn't clsoed");
|
|
|
|
let mut close_fut = std::pin::pin!(gate.close());
|
|
|
|
tokio::time::timeout(forever, &mut close_fut)
|
|
.await
|
|
.unwrap_err();
|
|
|
|
// we polled close_fut once, that should prevent all later enters and clones
|
|
gate.enter().unwrap_err();
|
|
guard1.try_clone().unwrap_err();
|
|
guard2.try_clone().unwrap_err();
|
|
|
|
// guard2 keeps gate open even if guard1 is closed
|
|
drop(guard1);
|
|
tokio::time::timeout(forever, &mut close_fut)
|
|
.await
|
|
.unwrap_err();
|
|
|
|
drop(guard2);
|
|
|
|
// now that the last guard is dropped, closing should complete
|
|
close_fut.await;
|
|
|
|
// entering is still forbidden
|
|
gate.enter().expect_err("enter should stilll fail");
|
|
}
|
|
}
|