fix(pageserver): use different walredo retry setting for gc-compaction (#11497)

## Problem Not a complete fix for https://github.com/neondatabase/neon/issues/11492 but should work for a short term. Our current retry strategy for walredo is to retry every request exactly once. This retry doesn't make sense because it retries all requests exactly once and each error is expected to cause process restart and cause future requests to fail. I'll explain it with a scenario of two threads requesting redos: one with an invalid history (that will cause walredo to panic) and another that has a correct redo sequence. First let's look at how we handle retries right now in do_with_walredo_process. At the beginning of the function it will spawn a new process if there's no existing one. Then it will continue to redo. If the process fails, the first process that encounters the error will remove the walredo process object from the OnceCell, so that the next time it gets accessed, a new process will be spawned; if it is the last one that uses the old walredo process, it will kill and wait the process in `drop(proc)`. I'm skeptical whether this works under races but I think this is not the root cause of the problem. In this retry handler, if there are N requests attached to a walredo process and the i-th request fails (panics the walredo), all other N-i requests will fail and they need to retry so that they can access a new walredo process. ``` time ----> proc A None B request 1 ^-----------------^ fail uses A for redo replace with None request 2 ^-------------------- fail uses A for redo request 3 ^----------------^ fail uses A for redo last ref, wait for A to be killed request 4 ^--------------- None, spawn new process B ``` The problem is with our retry strategy. Normally, for a system that we want to retry on, the probability of errors for each of the requests are uncorrelated. However, in walredo, a prior request that panics the walredo process will cause all future walredo on that process to fail (that's correlated). So, back to the situation where we have 2 requests where one will definitely fail and the other will succeed and we get the following sequence, where retry attempts = 1, * new walredo process A starts. * request 1 (invalid) being processed on A and panics A, waiting for retry, remove process A from the process object. * request 2 (valid) being processed on A and receives pipe broken / poisoned process error, waiting for retry, wait for A to be killed -- this very likely takes a while and cannot finish before request 1 gets processed again * new walredo process B starts. * request 1 (invalid) being processed again on B and panics B, the whole request fail. * request 2 (valid) being processed again on B, and get a poisoned error again. ``` time ----> proc A None B None request 1 ^-----------------^--------------^--------------------^ spawn A for redo fail spawn B for redo fail request 2 ^--------------------^-------------------------^------------^ use A for redo fail, wait to kill A B for redo fail again ``` In such cases, no matter how we set n_attempts, as long as the retry count applies to all requests, this sequence is bound to fail both requests because of how they get sequenced; while we could potentially make request 2 successful. There are many solutions to this -- like having a separate walredo manager for compactions, or define which errors are retryable (i.e., broken pipe can be retried, while real walredo error won't be retried), or having a exclusive big lock over the whole redo process (the current one is very fine-grained). In this patch, we go with a simple approach: use different retry attempts for different types of requests. For gc-compaction, the attempt count is set to 0, so that it never retries and consequently stops the compaction process -- no more redo will be issued from gc-compaction. Once the walredo process gets restarted, the normal read requests will proceed normally. ## Summary of changes Add redo_attempt for each reconstruct value request to set different retry policies. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Erik Grinaker <erik@neon.tech>
2026-01-04 12:02:55 +00:00 · 2025-04-09 14:01:31 -04:00
parent af12647b9d
commit ec66b788e2
5 changed files with 66 additions and 36 deletions
--- a/pageserver/benches/bench_walredo.rs
+++ b/pageserver/benches/bench_walredo.rs
@@ -65,7 +65,7 @@ use bytes::{Buf, Bytes};
 use criterion::{BenchmarkId, Criterion};
 use once_cell::sync::Lazy;
 use pageserver::config::PageServerConf;
-use pageserver::walredo::PostgresRedoManager;
+use pageserver::walredo::{PostgresRedoManager, RedoAttemptType};
 use pageserver_api::key::Key;
 use pageserver_api::record::NeonWalRecord;
 use pageserver_api::shard::TenantShardId;
@@ -223,7 +223,14 @@ impl Request {

        // TODO: avoid these clones
        manager
-            .request_redo(*key, *lsn, base_img.clone(), records.clone(), *pg_version)
+            .request_redo(
+                *key,
+                *lsn,
+                base_img.clone(),
+                records.clone(),
+                *pg_version,
+                RedoAttemptType::ReadPage,
+            )
            .await
            .context("request_redo")
    }
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -100,7 +100,7 @@ use crate::tenant::timeline::delete::DeleteTimelineFlow;
 use crate::tenant::timeline::uninit::cleanup_timeline_directory;
 use crate::virtual_file::VirtualFile;
 use crate::walingest::WalLagCooldown;
-use crate::walredo::PostgresRedoManager;
+use crate::walredo::{PostgresRedoManager, RedoAttemptType};
 use crate::{InitializationOrder, TEMP_FILE_SUFFIX, import_datadir, span, task_mgr, walredo};

 static INIT_DB_SEMAPHORE: Lazy<Semaphore> = Lazy::new(|| Semaphore::new(8));
@@ -473,15 +473,16 @@ impl WalRedoManager {
        base_img: Option<(Lsn, bytes::Bytes)>,
        records: Vec<(Lsn, pageserver_api::record::NeonWalRecord)>,
        pg_version: u32,
+        redo_attempt_type: RedoAttemptType,
    ) -> Result<bytes::Bytes, walredo::Error> {
        match self {
            Self::Prod(_, mgr) => {
-                mgr.request_redo(key, lsn, base_img, records, pg_version)
+                mgr.request_redo(key, lsn, base_img, records, pg_version, redo_attempt_type)
                    .await
            }
            #[cfg(test)]
            Self::Test(mgr) => {
-                mgr.request_redo(key, lsn, base_img, records, pg_version)
+                mgr.request_redo(key, lsn, base_img, records, pg_version, redo_attempt_type)
                    .await
            }
        }
@@ -5879,6 +5880,7 @@ pub(crate) mod harness {
            base_img: Option<(Lsn, Bytes)>,
            records: Vec<(Lsn, NeonWalRecord)>,
            _pg_version: u32,
+            _redo_attempt_type: RedoAttemptType,
        ) -> Result<Bytes, walredo::Error> {
            let records_neon = records.iter().all(|r| apply_neon::can_apply_in_neon(&r.1));
            if records_neon {
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -24,6 +24,7 @@ use std::sync::{Arc, Mutex, OnceLock, RwLock, Weak};
 use std::time::{Duration, Instant, SystemTime};

 use crate::PERF_TRACE_TARGET;
+use crate::walredo::RedoAttemptType;
 use anyhow::{Context, Result, anyhow, bail, ensure};
 use arc_swap::{ArcSwap, ArcSwapOption};
 use bytes::Bytes;
@@ -1293,6 +1294,12 @@ impl Timeline {
        };
        reconstruct_state.read_path = read_path;

+        let redo_attempt_type = if ctx.task_kind() == TaskKind::Compaction {
+            RedoAttemptType::LegacyCompaction
+        } else {
+            RedoAttemptType::ReadPage
+        };
+
        let traversal_res: Result<(), _> = {
            let ctx = RequestContextBuilder::from(ctx)
                .perf_span(|crnt_perf_span| {
@@ -1380,7 +1387,7 @@ impl Timeline {

                    let walredo_deltas = converted.num_deltas();
                    let walredo_res = walredo_self
-                        .reconstruct_value(key, lsn, converted)
+                        .reconstruct_value(key, lsn, converted, redo_attempt_type)
                        .maybe_perf_instrument(&ctx, |crnt_perf_span| {
                            info_span!(
                                target: PERF_TRACE_TARGET,
@@ -6343,37 +6350,21 @@ impl Timeline {

    /// Reconstruct a value, using the given base image and WAL records in 'data'.
    async fn reconstruct_value(
-        &self,
-        key: Key,
-        request_lsn: Lsn,
-        data: ValueReconstructState,
-    ) -> Result<Bytes, PageReconstructError> {
-        self.reconstruct_value_inner(key, request_lsn, data, false)
-            .await
-    }
-
-    /// Reconstruct a value, using the given base image and WAL records in 'data'. It does not fire critical errors because
-    /// sometimes it is expected to fail due to unreplayable history described in <https://github.com/neondatabase/neon/issues/10395>.
-    async fn reconstruct_value_wo_critical_error(
-        &self,
-        key: Key,
-        request_lsn: Lsn,
-        data: ValueReconstructState,
-    ) -> Result<Bytes, PageReconstructError> {
-        self.reconstruct_value_inner(key, request_lsn, data, true)
-            .await
-    }
-
-    async fn reconstruct_value_inner(
        &self,
        key: Key,
        request_lsn: Lsn,
        mut data: ValueReconstructState,
-        no_critical_error: bool,
+        redo_attempt_type: RedoAttemptType,
    ) -> Result<Bytes, PageReconstructError> {
        // Perform WAL redo if needed
        data.records.reverse();

+        let fire_critical_error = match redo_attempt_type {
+            RedoAttemptType::ReadPage => true,
+            RedoAttemptType::LegacyCompaction => true,
+            RedoAttemptType::GcCompaction => false,
+        };
+
        // If we have a page image, and no WAL, we're all set
        if data.records.is_empty() {
            if let Some((img_lsn, img)) = &data.img {
@@ -6420,13 +6411,20 @@ impl Timeline {
                    .as_ref()
                    .context("timeline has no walredo manager")
                    .map_err(PageReconstructError::WalRedo)?
-                    .request_redo(key, request_lsn, data.img, data.records, self.pg_version)
+                    .request_redo(
+                        key,
+                        request_lsn,
+                        data.img,
+                        data.records,
+                        self.pg_version,
+                        redo_attempt_type,
+                    )
                    .await;
                let img = match res {
                    Ok(img) => img,
                    Err(walredo::Error::Cancelled) => return Err(PageReconstructError::Cancelled),
                    Err(walredo::Error::Other(err)) => {
-                        if !no_critical_error {
+                        if fire_critical_error {
                            critical!("walredo failure during page reconstruction: {err:?}");
                        }
                        return Err(PageReconstructError::WalRedo(
--- a/pageserver/src/tenant/timeline/compaction.rs
+++ b/pageserver/src/tenant/timeline/compaction.rs
@@ -16,6 +16,8 @@ use super::{
    Timeline,
 };

+use crate::tenant::timeline::DeltaEntry;
+use crate::walredo::RedoAttemptType;
 use anyhow::{Context, anyhow};
 use bytes::Bytes;
 use enumset::EnumSet;
@@ -2411,7 +2413,7 @@ impl Timeline {
                    lsn_split_points[i]
                };
                let img = self
-                    .reconstruct_value_wo_critical_error(key, request_lsn, state)
+                    .reconstruct_value(key, request_lsn, state, RedoAttemptType::GcCompaction)
                    .await?;
                Some((request_lsn, img))
            } else {
@@ -3909,8 +3911,6 @@ impl CompactionLayer<Key> for OwnArc<DeltaLayer> {
    }
 }

-use crate::tenant::timeline::DeltaEntry;
-
 impl CompactionLayer<Key> for ResidentDeltaLayer {
    fn key_range(&self) -> &Range<Key> {
        &self.0.layer_desc().key_range
--- a/pageserver/src/walredo.rs
+++ b/pageserver/src/walredo.rs
@@ -136,6 +136,16 @@ macro_rules! bail {
    }
 }

+#[derive(Debug, Clone, Copy)]
+pub enum RedoAttemptType {
+    /// Used for the read path. Will fire critical errors and retry twice if failure.
+    ReadPage,
+    // Used for legacy compaction (only used in image compaction). Will fire critical errors and retry once if failure.
+    LegacyCompaction,
+    // Used for gc compaction. Will not fire critical errors and not retry.
+    GcCompaction,
+}
+
 ///
 /// Public interface of WAL redo manager
 ///
@@ -156,11 +166,18 @@ impl PostgresRedoManager {
        base_img: Option<(Lsn, Bytes)>,
        records: Vec<(Lsn, NeonWalRecord)>,
        pg_version: u32,
+        redo_attempt_type: RedoAttemptType,
    ) -> Result<Bytes, Error> {
        if records.is_empty() {
            bail!("invalid WAL redo request with no records");
        }

+        let max_retry_attempts = match redo_attempt_type {
+            RedoAttemptType::ReadPage => 2,
+            RedoAttemptType::LegacyCompaction => 1,
+            RedoAttemptType::GcCompaction => 0,
+        };
+
        let base_img_lsn = base_img.as_ref().map(|p| p.0).unwrap_or(Lsn::INVALID);
        let mut img = base_img.map(|p| p.1);
        let mut batch_neon = apply_neon::can_apply_in_neon(&records[0].1);
@@ -180,6 +197,7 @@ impl PostgresRedoManager {
                        &records[batch_start..i],
                        self.conf.wal_redo_timeout,
                        pg_version,
+                        max_retry_attempts,
                    )
                    .await
                };
@@ -201,6 +219,7 @@ impl PostgresRedoManager {
                &records[batch_start..],
                self.conf.wal_redo_timeout,
                pg_version,
+                max_retry_attempts,
            )
            .await
        }
@@ -424,11 +443,11 @@ impl PostgresRedoManager {
        records: &[(Lsn, NeonWalRecord)],
        wal_redo_timeout: Duration,
        pg_version: u32,
+        max_retry_attempts: u32,
    ) -> Result<Bytes, Error> {
        *(self.last_redo_at.lock().unwrap()) = Some(Instant::now());

        let (rel, blknum) = key.to_rel_block().context("invalid record")?;
-        const MAX_RETRY_ATTEMPTS: u32 = 1;
        let mut n_attempts = 0u32;
        loop {
            let base_img = &base_img;
@@ -486,7 +505,7 @@ impl PostgresRedoManager {
                info!(n_attempts, "retried walredo succeeded");
            }
            n_attempts += 1;
-            if n_attempts > MAX_RETRY_ATTEMPTS || result.is_ok() {
+            if n_attempts > max_retry_attempts || result.is_ok() {
                return result;
            }
        }
@@ -560,6 +579,7 @@ mod tests {

    use super::PostgresRedoManager;
    use crate::config::PageServerConf;
+    use crate::walredo::RedoAttemptType;

    #[tokio::test]
    async fn test_ping() {
@@ -593,6 +613,7 @@ mod tests {
                None,
                short_records(),
                14,
+                RedoAttemptType::ReadPage,
            )
            .instrument(h.span())
            .await
@@ -621,6 +642,7 @@ mod tests {
                None,
                short_records(),
                14,
+                RedoAttemptType::ReadPage,
            )
            .instrument(h.span())
            .await
@@ -642,6 +664,7 @@ mod tests {
                None,
                short_records(),
                16, /* 16 currently produces stderr output on startup, which adds a nice extra edge */
+                RedoAttemptType::ReadPage,
            )
            .instrument(h.span())
            .await