workflows: only run OLTP reuse_branch for shared_buffers testing

Skip hole tags in local_cache view (#11454 )
## Problem If the local file cache is shrunk, so that we punch some holes in the underlying file, the local_cache view displays the holes incorrectly. See https://github.com/neondatabase/neon/issues/10770 ## Summary of changes Skip hole tags in the local_cache view. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2026-05-17 21:20:37 +00:00 · 2025-04-08 11:36:45 +02:00 · 2025-04-08 03:52:50 +00:00 · 2025-04-07 21:19:06 +00:00 · 2025-04-07 19:10:36 +00:00 · 2025-04-07 17:56:56 +00:00
21 changed files with 395 additions and 190 deletions
--- a/.github/workflows/large_oltp_benchmark.yml
+++ b/.github/workflows/large_oltp_benchmark.yml
@@ -33,8 +33,8 @@ jobs:
      fail-fast: false # allow other variants to continue even if one fails
      matrix:
        include:
-          - target: new_branch 
-            custom_scripts: insert_webhooks.sql@200 select_any_webhook_with_skew.sql@300 select_recent_webhook.sql@397 select_prefetch_webhook.sql@3 IUD_one_transaction.sql@100
+          #- target: new_branch 
+          #  custom_scripts: insert_webhooks.sql@200 select_any_webhook_with_skew.sql@300 select_recent_webhook.sql@397 select_prefetch_webhook.sql@3 IUD_one_transaction.sql@100
          - target: reuse_branch 
            custom_scripts: insert_webhooks.sql@200 select_any_webhook_with_skew.sql@300 select_recent_webhook.sql@397 select_prefetch_webhook.sql@3 IUD_one_transaction.sql@100
      max-parallel: 1 # we want to run each stripe size sequentially to be able to compare the results
--- a/libs/pageserver_api/src/models.rs
+++ b/libs/pageserver_api/src/models.rs
@@ -80,10 +80,22 @@ pub enum TenantState {
    ///
    /// Transitions out of this state are possible through `set_broken()`.
    Stopping {
+        /// The barrier can be used to wait for shutdown to complete. The first caller to set
+        /// Some(Barrier) is responsible for driving shutdown to completion. Subsequent callers
+        /// will wait for the first caller's existing barrier.
+        ///
+        /// None is set when an attach is cancelled, to signal to shutdown that the attach has in
+        /// fact cancelled:
+        ///
+        /// 1. `shutdown` sees `TenantState::Attaching`, and cancels the tenant.
+        /// 2. `attach` sets `TenantState::Stopping(None)` and exits.
+        /// 3. `set_stopping` waits for `TenantState::Stopping(None)` and sets
+        ///    `TenantState::Stopping(Some)` to claim the barrier as the shutdown owner.
+        //
        // Because of https://github.com/serde-rs/serde/issues/2105 this has to be a named field,
        // otherwise it will not be skipped during deserialization
        #[serde(skip)]
-        progress: completion::Barrier,
+        progress: Option<completion::Barrier>,
    },
    /// The tenant is recognized by the pageserver, but can no longer be used for
    /// any operations.
@@ -2719,10 +2731,15 @@ mod tests {
                "Activating",
            ),
            (line!(), TenantState::Active, "Active"),
+            (
+                line!(),
+                TenantState::Stopping { progress: None },
+                "Stopping",
+            ),
            (
                line!(),
                TenantState::Stopping {
-                    progress: utils::completion::Barrier::default(),
+                    progress: Some(completion::Barrier::default()),
                },
                "Stopping",
            ),
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -45,6 +45,7 @@ use remote_timeline_client::manifest::{
 };
 use remote_timeline_client::{
    FAILED_REMOTE_OP_RETRIES, FAILED_UPLOAD_WARN_THRESHOLD, UploadQueueNotReadyError,
+    download_tenant_manifest,
 };
 use secondary::heatmap::{HeatMapTenant, HeatMapTimeline};
 use storage_broker::BrokerClientChannel;
@@ -226,7 +227,8 @@ struct TimelinePreload {
 }

 pub(crate) struct TenantPreload {
-    tenant_manifest: TenantManifest,
+    /// The tenant manifest from remote storage, or None if no manifest was found.
+    tenant_manifest: Option<TenantManifest>,
    /// Map from timeline ID to a possible timeline preload. It is None iff the timeline is offloaded according to the manifest.
    timelines: HashMap<TimelineId, Option<TimelinePreload>>,
 }
@@ -282,12 +284,15 @@ pub struct Tenant {
    /// **Lock order**: if acquiring all (or a subset), acquire them in order `timelines`, `timelines_offloaded`, `timelines_creating`
    timelines_offloaded: Mutex<HashMap<TimelineId, Arc<OffloadedTimeline>>>,

-    /// Serialize writes of the tenant manifest to remote storage.  If there are concurrent operations
-    /// affecting the manifest, such as timeline deletion and timeline offload, they must wait for
-    /// each other (this could be optimized to coalesce writes if necessary).
+    /// The last tenant manifest known to be in remote storage. None if the manifest has not yet
+    /// been either downloaded or uploaded. Always Some after tenant attach.
    ///
-    /// The contents of the Mutex are the last manifest we successfully uploaded
-    tenant_manifest_upload: tokio::sync::Mutex<Option<TenantManifest>>,
+    /// Initially populated during tenant attach, updated via `maybe_upload_tenant_manifest`.
+    ///
+    /// Do not modify this directly. It is used to check whether a new manifest needs to be
+    /// uploaded. The manifest is constructed in `build_tenant_manifest`, and uploaded via
+    /// `maybe_upload_tenant_manifest`.
+    remote_tenant_manifest: tokio::sync::Mutex<Option<TenantManifest>>,

    // This mutex prevents creation of new timelines during GC.
    // Adding yet another mutex (in addition to `timelines`) is needed because holding
@@ -1354,36 +1359,41 @@ impl Tenant {
                    }
                }

-                // Ideally we should use Tenant::set_broken_no_wait, but it is not supposed to be used when tenant is in loading state.
-                enum BrokenVerbosity {
-                    Error,
-                    Info
-                }
-                let make_broken =
-                    |t: &Tenant, err: anyhow::Error, verbosity: BrokenVerbosity| {
-                        match verbosity {
-                            BrokenVerbosity::Info => {
-                                info!("attach cancelled, setting tenant state to Broken: {err}");
-                            },
-                            BrokenVerbosity::Error => {
-                                error!("attach failed, setting tenant state to Broken: {err:?}");
-                            }
+                fn make_broken_or_stopping(t: &Tenant, err: anyhow::Error) {
+                    t.state.send_modify(|state| match state {
+                        // TODO: the old code alluded to DeleteTenantFlow sometimes setting
+                        // TenantState::Stopping before we get here, but this may be outdated.
+                        // Let's find out with a testing assertion. If this doesn't fire, and the
+                        // logs don't show this happening in production, remove the Stopping cases.
+                        TenantState::Stopping{..} if cfg!(any(test, feature = "testing")) => {
+                            panic!("unexpected TenantState::Stopping during attach")
                        }
-                        t.state.send_modify(|state| {
-                            // The Stopping case is for when we have passed control on to DeleteTenantFlow:
-                            // if it errors, we will call make_broken when tenant is already in Stopping.
-                            assert!(
-                                matches!(*state, TenantState::Attaching | TenantState::Stopping { .. }),
-                                "the attach task owns the tenant state until activation is complete"
-                            );
-
-                            *state = TenantState::broken_from_reason(err.to_string());
-                        });
-                    };
+                        // If the tenant is cancelled, assume the error was caused by cancellation.
+                        TenantState::Attaching if t.cancel.is_cancelled() => {
+                            info!("attach cancelled, setting tenant state to Stopping: {err}");
+                            // NB: progress None tells `set_stopping` that attach has cancelled.
+                            *state = TenantState::Stopping { progress: None };
+                        }
+                        // According to the old code, DeleteTenantFlow may already have set this to
+                        // Stopping. Retain its progress.
+                        // TODO: there is no DeleteTenantFlow. Is this still needed? See above.
+                        TenantState::Stopping { progress } if t.cancel.is_cancelled() => {
+                            assert!(progress.is_some(), "concurrent attach cancellation");
+                            info!("attach cancelled, already Stopping: {err}");
+                        }
+                        // Mark the tenant as broken.
+                        TenantState::Attaching | TenantState::Stopping { .. } => {
+                            error!("attach failed, setting tenant state to Broken (was {state}): {err:?}");
+                            *state = TenantState::broken_from_reason(err.to_string())
+                        }
+                        // The attach task owns the tenant state until activated.
+                        state => panic!("invalid tenant state {state} during attach: {err:?}"),
+                    });
+                }

                // TODO: should also be rejecting tenant conf changes that violate this check.
                if let Err(e) = crate::tenant::storage_layer::inmemory_layer::IndexEntry::validate_checkpoint_distance(tenant_clone.get_checkpoint_distance()) {
-                    make_broken(&tenant_clone, anyhow::anyhow!(e), BrokenVerbosity::Error);
+                    make_broken_or_stopping(&tenant_clone, anyhow::anyhow!(e));
                    return Ok(());
                }

@@ -1435,10 +1445,8 @@ impl Tenant {
                            // stayed in Activating for such a long time that shutdown found it in
                            // that state.
                            tracing::info!(state=%tenant_clone.current_state(), "Tenant shut down before activation");
-                            // Make the tenant broken so that set_stopping will not hang waiting for it to leave
-                            // the Attaching state.  This is an over-reaction (nothing really broke, the tenant is
-                            // just shutting down), but ensures progress.
-                            make_broken(&tenant_clone, anyhow::anyhow!("Shut down while Attaching"), BrokenVerbosity::Info);
+                            // Set the tenant to Stopping to signal `set_stopping` that we're done.
+                            make_broken_or_stopping(&tenant_clone, anyhow::anyhow!("Shut down while Attaching"));
                            return Ok(());
                        },
                    )
@@ -1457,7 +1465,7 @@ impl Tenant {
                        match res {
                            Ok(p) => Some(p),
                            Err(e) => {
-                                make_broken(&tenant_clone, anyhow::anyhow!(e), BrokenVerbosity::Error);
+                                make_broken_or_stopping(&tenant_clone, anyhow::anyhow!(e));
                                return Ok(());
                            }
                        }
@@ -1483,9 +1491,7 @@ impl Tenant {
                        info!("attach finished, activating");
                        tenant_clone.activate(broker_client, None, &ctx);
                    }
-                    Err(e) => {
-                        make_broken(&tenant_clone, anyhow::anyhow!(e), BrokenVerbosity::Error);
-                    }
+                    Err(e) => make_broken_or_stopping(&tenant_clone, anyhow::anyhow!(e)),
                }

                // If we are doing an opportunistic warmup attachment at startup, initialize
@@ -1525,28 +1531,27 @@ impl Tenant {
            cancel.clone(),
        )
        .await?;
-        let (offloaded_add, tenant_manifest) =
-            match remote_timeline_client::download_tenant_manifest(
-                remote_storage,
-                &self.tenant_shard_id,
-                self.generation,
-                &cancel,
-            )
-            .await
-            {
-                Ok((tenant_manifest, _generation, _manifest_mtime)) => (
-                    format!("{} offloaded", tenant_manifest.offloaded_timelines.len()),
-                    tenant_manifest,
-                ),
-                Err(DownloadError::NotFound) => {
-                    ("no manifest".to_string(), TenantManifest::empty())
-                }
-                Err(e) => Err(e)?,
-            };
+
+        let tenant_manifest = match download_tenant_manifest(
+            remote_storage,
+            &self.tenant_shard_id,
+            self.generation,
+            &cancel,
+        )
+        .await
+        {
+            Ok((tenant_manifest, _, _)) => Some(tenant_manifest),
+            Err(DownloadError::NotFound) => None,
+            Err(err) => return Err(err.into()),
+        };

        info!(
-            "found {} timelines, and {offloaded_add}",
-            remote_timeline_ids.len()
+            "found {} timelines ({} offloaded timelines)",
+            remote_timeline_ids.len(),
+            tenant_manifest
+                .as_ref()
+                .map(|m| m.offloaded_timelines.len())
+                .unwrap_or(0)
        );

        for k in other_keys {
@@ -1555,11 +1560,13 @@ impl Tenant {

        // Avoid downloading IndexPart of offloaded timelines.
        let mut offloaded_with_prefix = HashSet::new();
-        for offloaded in tenant_manifest.offloaded_timelines.iter() {
-            if remote_timeline_ids.remove(&offloaded.timeline_id) {
-                offloaded_with_prefix.insert(offloaded.timeline_id);
-            } else {
-                // We'll take care later of timelines in the manifest without a prefix
+        if let Some(tenant_manifest) = &tenant_manifest {
+            for offloaded in tenant_manifest.offloaded_timelines.iter() {
+                if remote_timeline_ids.remove(&offloaded.timeline_id) {
+                    offloaded_with_prefix.insert(offloaded.timeline_id);
+                } else {
+                    // We'll take care later of timelines in the manifest without a prefix
+                }
            }
        }

@@ -1633,12 +1640,14 @@ impl Tenant {

        let mut offloaded_timeline_ids = HashSet::new();
        let mut offloaded_timelines_list = Vec::new();
-        for timeline_manifest in preload.tenant_manifest.offloaded_timelines.iter() {
-            let timeline_id = timeline_manifest.timeline_id;
-            let offloaded_timeline =
-                OffloadedTimeline::from_manifest(self.tenant_shard_id, timeline_manifest);
-            offloaded_timelines_list.push((timeline_id, Arc::new(offloaded_timeline)));
-            offloaded_timeline_ids.insert(timeline_id);
+        if let Some(tenant_manifest) = &preload.tenant_manifest {
+            for timeline_manifest in tenant_manifest.offloaded_timelines.iter() {
+                let timeline_id = timeline_manifest.timeline_id;
+                let offloaded_timeline =
+                    OffloadedTimeline::from_manifest(self.tenant_shard_id, timeline_manifest);
+                offloaded_timelines_list.push((timeline_id, Arc::new(offloaded_timeline)));
+                offloaded_timeline_ids.insert(timeline_id);
+            }
        }
        // Complete deletions for offloaded timeline id's from manifest.
        // The manifest will be uploaded later in this function.
@@ -1796,15 +1805,21 @@ impl Tenant {
            .context("resume_deletion")
            .map_err(LoadLocalTimelineError::ResumeDeletion)?;
        }
-        let needs_manifest_upload =
-            offloaded_timelines_list.len() != preload.tenant_manifest.offloaded_timelines.len();
        {
            let mut offloaded_timelines_accessor = self.timelines_offloaded.lock().unwrap();
            offloaded_timelines_accessor.extend(offloaded_timelines_list.into_iter());
        }
-        if needs_manifest_upload {
-            self.store_tenant_manifest().await?;
+
+        // Stash the preloaded tenant manifest, and upload a new manifest if changed.
+        //
+        // NB: this must happen after the tenant is fully populated above. In particular the
+        // offloaded timelines, which are included in the manifest.
+        {
+            let mut guard = self.remote_tenant_manifest.lock().await;
+            assert!(guard.is_none(), "tenant manifest set before preload"); // first populated here
+            *guard = preload.tenant_manifest;
        }
+        self.maybe_upload_tenant_manifest().await?;

        // The local filesystem contents are a cache of what's in the remote IndexPart;
        // IndexPart is the source of truth.
@@ -2218,7 +2233,7 @@ impl Tenant {
        };

        // Upload new list of offloaded timelines to S3
-        self.store_tenant_manifest().await?;
+        self.maybe_upload_tenant_manifest().await?;

        // Activate the timeline (if it makes sense)
        if !(timeline.is_broken() || timeline.is_stopping()) {
@@ -3429,7 +3444,7 @@ impl Tenant {
            shutdown_mode
        };

-        match self.set_stopping(shutdown_progress, false, false).await {
+        match self.set_stopping(shutdown_progress).await {
            Ok(()) => {}
            Err(SetStoppingError::Broken) => {
                // assume that this is acceptable
@@ -3509,25 +3524,13 @@ impl Tenant {
    /// This function waits for the tenant to become active if it isn't already, before transitioning it into Stopping state.
    ///
    /// This function is not cancel-safe!
-    ///
-    /// `allow_transition_from_loading` is needed for the special case of loading task deleting the tenant.
-    /// `allow_transition_from_attaching` is needed for the special case of attaching deleted tenant.
-    async fn set_stopping(
-        &self,
-        progress: completion::Barrier,
-        _allow_transition_from_loading: bool,
-        allow_transition_from_attaching: bool,
-    ) -> Result<(), SetStoppingError> {
+    async fn set_stopping(&self, progress: completion::Barrier) -> Result<(), SetStoppingError> {
        let mut rx = self.state.subscribe();

        // cannot stop before we're done activating, so wait out until we're done activating
        rx.wait_for(|state| match state {
-            TenantState::Attaching if allow_transition_from_attaching => true,
            TenantState::Activating(_) | TenantState::Attaching => {
-                info!(
-                    "waiting for {} to turn Active|Broken|Stopping",
-                    <&'static str>::from(state)
-                );
+                info!("waiting for {state} to turn Active|Broken|Stopping");
                false
            }
            TenantState::Active | TenantState::Broken { .. } | TenantState::Stopping { .. } => true,
@@ -3538,25 +3541,24 @@ impl Tenant {
        // we now know we're done activating, let's see whether this task is the winner to transition into Stopping
        let mut err = None;
        let stopping = self.state.send_if_modified(|current_state| match current_state {
-            TenantState::Activating(_) => {
-                unreachable!("1we ensured above that we're done with activation, and, there is no re-activation")
-            }
-            TenantState::Attaching => {
-                if !allow_transition_from_attaching {
-                    unreachable!("2we ensured above that we're done with activation, and, there is no re-activation")
-                };
-                *current_state = TenantState::Stopping { progress };
-                true
+            TenantState::Activating(_) | TenantState::Attaching => {
+                unreachable!("we ensured above that we're done with activation, and, there is no re-activation")
            }
            TenantState::Active => {
                // FIXME: due to time-of-check vs time-of-use issues, it can happen that new timelines
                // are created after the transition to Stopping. That's harmless, as the Timelines
                // won't be accessible to anyone afterwards, because the Tenant is in Stopping state.
-                *current_state = TenantState::Stopping { progress };
+                *current_state = TenantState::Stopping { progress: Some(progress) };
                // Continue stopping outside the closure. We need to grab timelines.lock()
                // and we plan to turn it into a tokio::sync::Mutex in a future patch.
                true
            }
+            TenantState::Stopping { progress: None } => {
+                // An attach was cancelled, and the attach transitioned the tenant from Attaching to
+                // Stopping(None) to let us know it exited. Register our progress and continue.
+                *current_state = TenantState::Stopping { progress: Some(progress) };
+                true
+            }
            TenantState::Broken { reason, .. } => {
                info!(
                    "Cannot set tenant to Stopping state, it is in Broken state due to: {reason}"
@@ -3564,7 +3566,7 @@ impl Tenant {
                err = Some(SetStoppingError::Broken);
                false
            }
-            TenantState::Stopping { progress } => {
+            TenantState::Stopping { progress: Some(progress) } => {
                info!("Tenant is already in Stopping state");
                err = Some(SetStoppingError::AlreadyStopping(progress.clone()));
                false
@@ -4065,18 +4067,19 @@ impl Tenant {

    /// Generate an up-to-date TenantManifest based on the state of this Tenant.
    fn build_tenant_manifest(&self) -> TenantManifest {
-        let timelines_offloaded = self.timelines_offloaded.lock().unwrap();
-
-        let mut timeline_manifests = timelines_offloaded
-            .iter()
-            .map(|(_timeline_id, offloaded)| offloaded.manifest())
-            .collect::<Vec<_>>();
-        // Sort the manifests so that our output is deterministic
-        timeline_manifests.sort_by_key(|timeline_manifest| timeline_manifest.timeline_id);
+        // Collect the offloaded timelines, and sort them for deterministic output.
+        let offloaded_timelines = self
+            .timelines_offloaded
+            .lock()
+            .unwrap()
+            .values()
+            .map(|tli| tli.manifest())
+            .sorted_by_key(|m| m.timeline_id)
+            .collect_vec();

        TenantManifest {
            version: LATEST_TENANT_MANIFEST_VERSION,
-            offloaded_timelines: timeline_manifests,
+            offloaded_timelines,
        }
    }

@@ -4299,7 +4302,7 @@ impl Tenant {
            timelines: Mutex::new(HashMap::new()),
            timelines_creating: Mutex::new(HashSet::new()),
            timelines_offloaded: Mutex::new(HashMap::new()),
-            tenant_manifest_upload: Default::default(),
+            remote_tenant_manifest: Default::default(),
            gc_cs: tokio::sync::Mutex::new(()),
            walredo_mgr,
            remote_storage,
@@ -5532,27 +5535,35 @@ impl Tenant {
            .unwrap_or(0)
    }

-    /// Serialize and write the latest TenantManifest to remote storage.
-    pub(crate) async fn store_tenant_manifest(&self) -> Result<(), TenantManifestError> {
-        // Only one manifest write may be done at at time, and the contents of the manifest
-        // must be loaded while holding this lock. This makes it safe to call this function
-        // from anywhere without worrying about colliding updates.
+    /// Builds a new tenant manifest, and uploads it if it differs from the last-known tenant
+    /// manifest in `Self::remote_tenant_manifest`.
+    ///
+    /// TODO: instead of requiring callers to remember to call `maybe_upload_tenant_manifest` after
+    /// changing any `Tenant` state that's included in the manifest, consider making the manifest
+    /// the authoritative source of data with an API that automatically uploads on changes. Revisit
+    /// this when the manifest is more widely used and we have a better idea of the data model.
+    pub(crate) async fn maybe_upload_tenant_manifest(&self) -> Result<(), TenantManifestError> {
+        // Multiple tasks may call this function concurrently after mutating the Tenant runtime
+        // state, affecting the manifest generated by `build_tenant_manifest`. We use an async mutex
+        // to serialize these callers. `eq_ignoring_version` acts as a slightly inefficient but
+        // simple coalescing mechanism.
        let mut guard = tokio::select! {
-            g = self.tenant_manifest_upload.lock() => {
-                g
-            },
-            _ = self.cancel.cancelled() => {
-                return Err(TenantManifestError::Cancelled);
-            }
+            guard = self.remote_tenant_manifest.lock() => guard,
+            _ = self.cancel.cancelled() => return Err(TenantManifestError::Cancelled),
        };

+        // Build a new manifest.
        let manifest = self.build_tenant_manifest();
-        if Some(&manifest) == (*guard).as_ref() {
-            // Optimisation: skip uploads that don't change anything.
-            return Ok(());
+
+        // Check if the manifest has changed. We ignore the version number here, to avoid
+        // uploading every manifest on version number bumps.
+        if let Some(old) = guard.as_ref() {
+            if manifest.eq_ignoring_version(old) {
+                return Ok(());
+            }
        }

-        // Remote storage does no retries internally, so wrap it
+        // Upload the manifest. Remote storage does no retries internally, so retry here.
        match backoff::retry(
            || async {
                upload_tenant_manifest(
@@ -5564,7 +5575,7 @@ impl Tenant {
                )
                .await
            },
-            |_e| self.cancel.is_cancelled(),
+            |_| self.cancel.is_cancelled(),
            FAILED_UPLOAD_WARN_THRESHOLD,
            FAILED_REMOTE_OP_RETRIES,
            "uploading tenant manifest",
--- a/pageserver/src/tenant/remote_timeline_client/manifest.rs
+++ b/pageserver/src/tenant/remote_timeline_client/manifest.rs
@@ -3,11 +3,15 @@ use serde::{Deserialize, Serialize};
 use utils::id::TimelineId;
 use utils::lsn::Lsn;

-/// Tenant-shard scoped manifest
-#[derive(Clone, Serialize, Deserialize, PartialEq, Eq)]
+/// Tenant shard manifest, stored in remote storage. Contains offloaded timelines and other tenant
+/// shard-wide information that must be persisted in remote storage.
+///
+/// The manifest is always updated on tenant attach, and as needed.
+#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
 pub struct TenantManifest {
-    /// Debugging aid describing the version of this manifest.
-    /// Can also be used for distinguishing breaking changes later on.
+    /// The manifest version. Incremented on manifest format changes, even non-breaking ones.
+    /// Manifests must generally always be backwards and forwards compatible for one release, to
+    /// allow release rollbacks.
    pub version: usize,

    /// The list of offloaded timelines together with enough information
@@ -16,6 +20,7 @@ pub struct TenantManifest {
    /// Note: the timelines mentioned in this list might be deleted, i.e.
    /// we don't hold an invariant that the references aren't dangling.
    /// Existence of index-part.json is the actual indicator of timeline existence.
+    #[serde(default)]
    pub offloaded_timelines: Vec<OffloadedTimelineManifest>,
 }

@@ -24,7 +29,7 @@ pub struct TenantManifest {
 /// Very similar to [`pageserver_api::models::OffloadedTimelineInfo`],
 /// but the two datastructures serve different needs, this is for a persistent disk format
 /// that must be backwards compatible, while the other is only for informative purposes.
-#[derive(Clone, Serialize, Deserialize, Copy, PartialEq, Eq)]
+#[derive(Clone, Debug, Serialize, Deserialize, Copy, PartialEq, Eq)]
 pub struct OffloadedTimelineManifest {
    pub timeline_id: TimelineId,
    /// Whether the timeline has a parent it has been branched off from or not
@@ -35,20 +40,114 @@ pub struct OffloadedTimelineManifest {
    pub archived_at: NaiveDateTime,
 }

+/// The newest manifest version. This should be incremented on changes, even non-breaking ones. We
+/// do not use deny_unknown_fields, so new fields are not breaking.
 pub const LATEST_TENANT_MANIFEST_VERSION: usize = 1;

 impl TenantManifest {
-    pub(crate) fn empty() -> Self {
-        Self {
-            version: LATEST_TENANT_MANIFEST_VERSION,
-            offloaded_timelines: vec![],
+    /// Returns true if the manifests are equal, ignoring the version number. This avoids
+    /// re-uploading all manifests just because the version number is bumped.
+    pub fn eq_ignoring_version(&self, other: &Self) -> bool {
+        // Fast path: if the version is equal, just compare directly.
+        if self.version == other.version {
+            return self == other;
        }
-    }
-    pub fn from_json_bytes(bytes: &[u8]) -> Result<Self, serde_json::Error> {
-        serde_json::from_slice::<Self>(bytes)
+
+        // We could alternatively just clone and modify the version here.
+        let Self {
+            version: _, // ignore version
+            offloaded_timelines,
+        } = self;
+
+        offloaded_timelines == &other.offloaded_timelines
    }

-    pub(crate) fn to_json_bytes(&self) -> serde_json::Result<Vec<u8>> {
+    /// Decodes a manifest from JSON.
+    pub fn from_json_bytes(bytes: &[u8]) -> Result<Self, serde_json::Error> {
+        serde_json::from_slice(bytes)
+    }
+
+    /// Encodes a manifest as JSON.
+    pub fn to_json_bytes(&self) -> serde_json::Result<Vec<u8>> {
        serde_json::to_vec(self)
    }
 }
+
+#[cfg(test)]
+mod tests {
+    use std::str::FromStr;
+
+    use utils::id::TimelineId;
+
+    use super::*;
+
+    /// Empty manifests should be parsed. Version is required.
+    #[test]
+    fn parse_empty() -> anyhow::Result<()> {
+        let json = r#"{
+             "version": 0
+         }"#;
+        let expected = TenantManifest {
+            version: 0,
+            offloaded_timelines: Vec::new(),
+        };
+        assert_eq!(expected, TenantManifest::from_json_bytes(json.as_bytes())?);
+        Ok(())
+    }
+
+    /// Unknown fields should be ignored, for forwards compatibility.
+    #[test]
+    fn parse_unknown_fields() -> anyhow::Result<()> {
+        let json = r#"{
+             "version": 1,
+             "foo": "bar"
+         }"#;
+        let expected = TenantManifest {
+            version: 1,
+            offloaded_timelines: Vec::new(),
+        };
+        assert_eq!(expected, TenantManifest::from_json_bytes(json.as_bytes())?);
+        Ok(())
+    }
+
+    /// v1 manifests should be parsed, for backwards compatibility.
+    #[test]
+    fn parse_v1() -> anyhow::Result<()> {
+        let json = r#"{
+             "version": 1,
+             "offloaded_timelines": [
+                 {
+                     "timeline_id": "5c4df612fd159e63c1b7853fe94d97da",
+                     "archived_at": "2025-03-07T11:07:11.373105434"
+                 },
+                 {
+                     "timeline_id": "f3def5823ad7080d2ea538d8e12163fa",
+                     "ancestor_timeline_id": "5c4df612fd159e63c1b7853fe94d97da",
+                     "ancestor_retain_lsn": "0/1F79038",
+                     "archived_at": "2025-03-05T11:10:22.257901390"
+                 }
+             ]
+         }"#;
+        let expected = TenantManifest {
+            version: 1,
+            offloaded_timelines: vec![
+                OffloadedTimelineManifest {
+                    timeline_id: TimelineId::from_str("5c4df612fd159e63c1b7853fe94d97da")?,
+                    ancestor_timeline_id: None,
+                    ancestor_retain_lsn: None,
+                    archived_at: NaiveDateTime::from_str("2025-03-07T11:07:11.373105434")?,
+                },
+                OffloadedTimelineManifest {
+                    timeline_id: TimelineId::from_str("f3def5823ad7080d2ea538d8e12163fa")?,
+                    ancestor_timeline_id: Some(TimelineId::from_str(
+                        "5c4df612fd159e63c1b7853fe94d97da",
+                    )?),
+                    ancestor_retain_lsn: Some(Lsn::from_str("0/1F79038")?),
+                    archived_at: NaiveDateTime::from_str("2025-03-05T11:10:22.257901390")?,
+                },
+            ],
+        };
+        assert_eq!(expected, TenantManifest::from_json_bytes(json.as_bytes())?);
+        Ok(())
+    }
+}
--- a/pageserver/src/tenant/remote_timeline_client/upload.rs
+++ b/pageserver/src/tenant/remote_timeline_client/upload.rs
@@ -61,6 +61,7 @@ pub(crate) async fn upload_index_part(
        .await
        .with_context(|| format!("upload index part for '{tenant_shard_id} / {timeline_id}'"))
 }
+
 /// Serializes and uploads the given tenant manifest data to the remote storage.
 pub(crate) async fn upload_tenant_manifest(
    storage: &GenericRemoteStorage,
@@ -76,16 +77,14 @@ pub(crate) async fn upload_tenant_manifest(
    });
    pausable_failpoint!("before-upload-manifest-pausable");

-    let serialized = tenant_manifest.to_json_bytes()?;
-    let serialized = Bytes::from(serialized);
-
-    let tenant_manifest_site = serialized.len();
-
+    let serialized = Bytes::from(tenant_manifest.to_json_bytes()?);
+    let tenant_manifest_size = serialized.len();
    let remote_path = remote_tenant_manifest_path(tenant_shard_id, generation);
+
    storage
        .upload_storage_object(
            futures::stream::once(futures::future::ready(Ok(serialized))),
-            tenant_manifest_site,
+            tenant_manifest_size,
            &remote_path,
            cancel,
        )
--- a/pageserver/src/tenant/tasks.rs
+++ b/pageserver/src/tenant/tasks.rs
@@ -268,7 +268,12 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
                error_run += 1;
                let backoff =
                    exponential_backoff_duration(error_run, BASE_BACKOFF_SECS, MAX_BACKOFF_SECS);
-                log_compaction_error(&err, Some((error_run, backoff)), cancel.is_cancelled());
+                log_compaction_error(
+                    &err,
+                    Some((error_run, backoff)),
+                    cancel.is_cancelled(),
+                    false,
+                );
                continue;
            }
        }
@@ -285,6 +290,7 @@ pub(crate) fn log_compaction_error(
    err: &CompactionError,
    retry_info: Option<(u32, Duration)>,
    task_cancelled: bool,
+    degrade_to_warning: bool,
 ) {
    use CompactionError::*;

@@ -333,6 +339,7 @@ pub(crate) fn log_compaction_error(
        }
    } else {
        match level {
+            Level::ERROR if degrade_to_warning => warn!("Compaction failed and discarded: {err:#}"),
            Level::ERROR => error!("Compaction failed: {err:#}"),
            Level::INFO => info!("Compaction failed: {err:#}"),
            level => unimplemented!("unexpected level {level:?}"),
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -1940,7 +1940,7 @@ impl Timeline {
            )
            .await;
        if let Err(err) = &res {
-            log_compaction_error(err, None, cancel.is_cancelled());
+            log_compaction_error(err, None, cancel.is_cancelled(), false);
        }
        res
    }
@@ -6353,10 +6353,33 @@ impl Timeline {

    /// Reconstruct a value, using the given base image and WAL records in 'data'.
    async fn reconstruct_value(
+        &self,
+        key: Key,
+        request_lsn: Lsn,
+        data: ValueReconstructState,
+    ) -> Result<Bytes, PageReconstructError> {
+        self.reconstruct_value_inner(key, request_lsn, data, false)
+            .await
+    }
+
+    /// Reconstruct a value, using the given base image and WAL records in 'data'. It does not fire critical errors because
+    /// sometimes it is expected to fail due to unreplayable history described in <https://github.com/neondatabase/neon/issues/10395>.
+    async fn reconstruct_value_wo_critical_error(
+        &self,
+        key: Key,
+        request_lsn: Lsn,
+        data: ValueReconstructState,
+    ) -> Result<Bytes, PageReconstructError> {
+        self.reconstruct_value_inner(key, request_lsn, data, true)
+            .await
+    }
+
+    async fn reconstruct_value_inner(
        &self,
        key: Key,
        request_lsn: Lsn,
        mut data: ValueReconstructState,
+        no_critical_error: bool,
    ) -> Result<Bytes, PageReconstructError> {
        // Perform WAL redo if needed
        data.records.reverse();
@@ -6413,7 +6436,9 @@ impl Timeline {
                    Ok(img) => img,
                    Err(walredo::Error::Cancelled) => return Err(PageReconstructError::Cancelled),
                    Err(walredo::Error::Other(err)) => {
-                        critical!("walredo failure during page reconstruction: {err:?}");
+                        if !no_critical_error {
+                            critical!("walredo failure during page reconstruction: {err:?}");
+                        }
                        return Err(PageReconstructError::WalRedo(
                            err.context("reconstruct a page image"),
                        ));
--- a/pageserver/src/tenant/timeline/compaction.rs
+++ b/pageserver/src/tenant/timeline/compaction.rs
@@ -448,7 +448,7 @@ impl GcCompactionQueue {
    ) -> Result<CompactionOutcome, CompactionError> {
        let res = self.iteration_inner(cancel, ctx, gc_block, timeline).await;
        if let Err(err) = &res {
-            log_compaction_error(err, None, cancel.is_cancelled());
+            log_compaction_error(err, None, cancel.is_cancelled(), true);
        }
        match res {
            Ok(res) => Ok(res),
@@ -1244,6 +1244,10 @@ impl Timeline {
        let mut replace_image_layers = Vec::new();

        for layer in layers_to_rewrite {
+            if self.cancel.is_cancelled() {
+                return Err(CompactionError::ShuttingDown);
+            }
+
            tracing::info!(layer=%layer, "Rewriting layer after shard split...");
            let mut image_layer_writer = ImageLayerWriter::new(
                self.conf,
@@ -2406,7 +2410,9 @@ impl Timeline {
                } else {
                    lsn_split_points[i]
                };
-                let img = self.reconstruct_value(key, request_lsn, state).await?;
+                let img = self
+                    .reconstruct_value_wo_critical_error(key, request_lsn, state)
+                    .await?;
                Some((request_lsn, img))
            } else {
                None
@@ -3102,8 +3108,6 @@ impl Timeline {
        // the key and LSN range are determined. However, to keep things simple here, we still
        // create this writer, and discard the writer in the end.

-        let mut keys_processed = 0;
-
        while let Some(((key, lsn, val), desc)) = merge_iter
            .next_with_trace()
            .await
@@ -3114,9 +3118,7 @@ impl Timeline {
                return Err(CompactionError::ShuttingDown);
            }

-            keys_processed += 1;
            let should_yield = yield_for_l0
-                && keys_processed % 1000 == 0
                && self
                    .l0_compaction_trigger
                    .notified()
--- a/pageserver/src/tenant/timeline/delete.rs
+++ b/pageserver/src/tenant/timeline/delete.rs
@@ -410,10 +410,13 @@ impl DeleteTimelineFlow {
        // So indeed, the tenant manifest might refer to an offloaded timeline which has already been deleted.
        // However, we handle this case in tenant loading code so the next time we attach, the issue is
        // resolved.
-        tenant.store_tenant_manifest().await.map_err(|e| match e {
-            TenantManifestError::Cancelled => DeleteTimelineError::Cancelled,
-            _ => DeleteTimelineError::Other(e.into()),
-        })?;
+        tenant
+            .maybe_upload_tenant_manifest()
+            .await
+            .map_err(|err| match err {
+                TenantManifestError::Cancelled => DeleteTimelineError::Cancelled,
+                err => DeleteTimelineError::Other(err.into()),
+            })?;

        *guard = Self::Finished;

--- a/pageserver/src/tenant/timeline/offload.rs
+++ b/pageserver/src/tenant/timeline/offload.rs
@@ -111,7 +111,7 @@ pub(crate) async fn offload_timeline(
    // at the next restart attach it again.
    // For that to happen, we'd need to make the manifest reflect our *intended* state,
    // not our actual state of offloaded timelines.
-    tenant.store_tenant_manifest().await?;
+    tenant.maybe_upload_tenant_manifest().await?;

    tracing::info!("Timeline offload complete (remaining arc refcount: {remaining_refcount})");

--- a/pgxn/neon/file_cache.c
+++ b/pgxn/neon/file_cache.c
@@ -1563,8 +1563,12 @@ local_cache_pages(PG_FUNCTION_ARGS)
 				hash_seq_init(&status, lfc_hash);
 				while ((entry = hash_seq_search(&status)) != NULL)
 				{
-					for (int i = 0; i < BLOCKS_PER_CHUNK; i++)
-						n_pages += GET_STATE(entry, i) == AVAILABLE;
+					/* Skip hole tags */
+					if (NInfoGetRelNumber(BufTagGetNRelFileInfo(entry->key)) != 0)
+					{
+						for (int i = 0; i < BLOCKS_PER_CHUNK; i++)
+							n_pages += GET_STATE(entry, i) == AVAILABLE;
+					}
 				}
 			}
 		}
@@ -1592,16 +1596,19 @@ local_cache_pages(PG_FUNCTION_ARGS)
 			{
 				for (int i = 0; i < BLOCKS_PER_CHUNK; i++)
 				{
-					if (GET_STATE(entry, i) == AVAILABLE)
+					if (NInfoGetRelNumber(BufTagGetNRelFileInfo(entry->key)) != 0)
 					{
-						fctx->record[n].pageoffs = entry->offset * BLOCKS_PER_CHUNK + i;
-						fctx->record[n].relfilenode = NInfoGetRelNumber(BufTagGetNRelFileInfo(entry->key));
-						fctx->record[n].reltablespace = NInfoGetSpcOid(BufTagGetNRelFileInfo(entry->key));
-						fctx->record[n].reldatabase = NInfoGetDbOid(BufTagGetNRelFileInfo(entry->key));
-						fctx->record[n].forknum = entry->key.forkNum;
-						fctx->record[n].blocknum = entry->key.blockNum + i;
-						fctx->record[n].accesscount = entry->access_count;
-						n += 1;
+						if (GET_STATE(entry, i) == AVAILABLE)
+						{
+							fctx->record[n].pageoffs = entry->offset * BLOCKS_PER_CHUNK + i;
+							fctx->record[n].relfilenode = NInfoGetRelNumber(BufTagGetNRelFileInfo(entry->key));
+							fctx->record[n].reltablespace = NInfoGetSpcOid(BufTagGetNRelFileInfo(entry->key));
+							fctx->record[n].reldatabase = NInfoGetDbOid(BufTagGetNRelFileInfo(entry->key));
+							fctx->record[n].forknum = entry->key.forkNum;
+							fctx->record[n].blocknum = entry->key.blockNum + i;
+							fctx->record[n].accesscount = entry->access_count;
+							n += 1;
+						}
 					}
 				}
 			}
--- a/pgxn/neon/pagestore_smgr.c
+++ b/pgxn/neon/pagestore_smgr.c
@@ -1900,7 +1900,6 @@ neon_wallog_pagev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		log_pages = true;
 	}
 	else if (XLogInsertAllowed() &&
-			 !ShutdownRequestPending &&
 			 (forknum == FSM_FORKNUM || forknum == VISIBILITYMAP_FORKNUM))
 	{
 		log_pages = true;
--- a/safekeeper/src/bin/safekeeper.rs
+++ b/safekeeper/src/bin/safekeeper.rs
@@ -223,6 +223,9 @@ struct Args {
    /// Flag to use https for requests to peer's safekeeper API.
    #[arg(long)]
    pub use_https_safekeeper_api: bool,
+    /// Path to the JWT auth token used to authenticate with other safekeepers.
+    #[arg(long)]
+    auth_token_path: Option<Utf8PathBuf>,
 }

 // Like PathBufValueParser, but allows empty string.
@@ -341,14 +344,24 @@ async fn main() -> anyhow::Result<()> {
    };

    // Load JWT auth token to connect to other safekeepers for pull_timeline.
+    // First check if the env var is present, then check the arg with the path.
+    // We want to deprecate and remove the env var method in the future.
    let sk_auth_token = match var("SAFEKEEPER_AUTH_TOKEN") {
        Ok(v) => {
            info!("loaded JWT token for authentication with safekeepers");
            Some(SecretString::from(v))
        }
        Err(VarError::NotPresent) => {
-            info!("no JWT token for authentication with safekeepers detected");
-            None
+            if let Some(auth_token_path) = args.auth_token_path.as_ref() {
+                info!(
+                    "loading JWT token for authentication with safekeepers from {auth_token_path}"
+                );
+                let auth_token = tokio::fs::read_to_string(auth_token_path).await?;
+                Some(SecretString::from(auth_token.trim().to_owned()))
+            } else {
+                info!("no JWT token for authentication with safekeepers detected");
+                None
+            }
        }
        Err(_) => {
            warn!("JWT token for authentication with safekeepers is not unicode");
--- a/test_runner/regress/test_lfc_resize.py
+++ b/test_runner/regress/test_lfc_resize.py
@@ -49,6 +49,8 @@ def test_lfc_resize(neon_simple_env: NeonEnv, pg_bin: PgBin):
    conn = endpoint.connect()
    cur = conn.cursor()

+    cur.execute("create extension neon")
+
    def get_lfc_size() -> tuple[int, int]:
        lfc_file_path = endpoint.lfc_path()
        lfc_file_size = lfc_file_path.stat().st_size
@@ -103,3 +105,23 @@ def test_lfc_resize(neon_simple_env: NeonEnv, pg_bin: PgBin):
        time.sleep(1)

    assert int(lfc_file_blocks) <= 128 * 1024
+
+    # Now test that number of rows returned by local_cache is the same as file_cache_used_pages.
+    # Perform several iterations to make cache cache content stabilized.
+    nretries = 10
+    while True:
+        cur.execute("select count(*) from local_cache")
+        local_cache_size = cur.fetchall()[0][0]
+
+        cur.execute(
+            "select lfc_value::bigint FROM neon_lfc_stats where lfc_key='file_cache_used_pages'"
+        )
+        used_pages = cur.fetchall()[0][0]
+
+        if local_cache_size == used_pages or nretries == 0:
+            break
+
+        nretries = nretries - 1
+        time.sleep(1)
+
+    assert local_cache_size == used_pages
--- a/test_runner/regress/test_storage_controller.py
+++ b/test_runner/regress/test_storage_controller.py
@@ -4109,6 +4109,7 @@ def test_storcon_create_delete_sk_down(neon_env_builder: NeonEnvBuilder, restart
    env.storage_controller.allowed_errors.extend(
        [
            ".*Call to safekeeper.* management API still failed after.*",
+            ".*Call to safekeeper.* management API failed, will retry.*",
            ".*reconcile_one.*tenant_id={tenant_id}.*Call to safekeeper.* management API still failed after.*",
        ]
    )
--- a/test_runner/regress/test_timeline_archive.py
+++ b/test_runner/regress/test_timeline_archive.py
@@ -318,7 +318,7 @@ def test_timeline_offload_persist(neon_env_builder: NeonEnvBuilder, delete_timel
        neon_env_builder.pageserver_remote_storage,
        prefix=f"tenants/{str(tenant_id)}/",
    )
-    assert_prefix_empty(
+    assert_prefix_not_empty(
        neon_env_builder.pageserver_remote_storage,
        prefix=f"tenants/{str(tenant_id)}/tenant-manifest",
    )
@@ -387,7 +387,7 @@ def test_timeline_offload_persist(neon_env_builder: NeonEnvBuilder, delete_timel
            sum_again = endpoint.safe_psql("SELECT sum(key) from foo where key < 500")
            assert sum == sum_again

-        assert_prefix_empty(
+        assert_prefix_not_empty(
            neon_env_builder.pageserver_remote_storage,
            prefix=f"tenants/{str(env.initial_tenant)}/tenant-manifest",
        )
@@ -924,7 +924,7 @@ def test_timeline_offload_generations(neon_env_builder: NeonEnvBuilder):
        neon_env_builder.pageserver_remote_storage,
        prefix=f"tenants/{str(tenant_id)}/",
    )
-    assert_prefix_empty(
+    assert_prefix_not_empty(
        neon_env_builder.pageserver_remote_storage,
        prefix=f"tenants/{str(tenant_id)}/tenant-manifest",
    )
--- a/vendor/postgres-v14
+++ b/vendor/postgres-v14
--- a/vendor/postgres-v15
+++ b/vendor/postgres-v15
--- a/vendor/postgres-v16
+++ b/vendor/postgres-v16
--- a/vendor/postgres-v17
+++ b/vendor/postgres-v17
--- a/vendor/revisions.json
+++ b/vendor/revisions.json
@@ -1,18 +1,18 @@
 {
  "v17": [
    "17.4",
-    "c9e4ff5a38907acd71107634055bf2609aba43a5"
+    "66114c23bc61205b0e3fb1e77ee76a4abc1eb4b8"
  ],
  "v16": [
    "16.8",
-    "746bd9ffe5c29bce030eaea1031054057f3c5d45"
+    "d56e79cd5d6136c159b1d8d98acb7981d4b69364"
  ],
  "v15": [
    "15.12",
-    "23708b3aca9adf163aa0973eb63d9afc0e4a04c3"
+    "aeb292eeace9072e07071254b6ffc7a74007d4d2"
  ],
  "v14": [
    "14.17",
-    "8cca70c22e2894dd4645f9a940086ac437b0a11b"
+    "a0391901a2af13aa029b905272a5b2024133c926"
  ]
 }
Author	SHA1	Message	Date
Erik Grinaker	63e4032724	workflows: only run OLTP `reuse_branch` for shared_buffers testing	2025-04-08 11:36:45 +02:00
Konstantin Knizhnik	b2a0b2e9dd	Skip hole tags in local_cache view (#11454 ) ## Problem If the local file cache is shrunk, so that we punch some holes in the underlying file, the local_cache view displays the holes incorrectly. See https://github.com/neondatabase/neon/issues/10770 ## Summary of changes Skip hole tags in the local_cache view. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-08 03:52:50 +00:00
Alex Chi Z.	0875dacce0	fix(pageserver): more aggressively yield in gc-compaction, degrade errors to warnings (#11469 ) ## Problem Fix various small issues discovered during gc-compaction rollout. ## Summary of changes - Log level changes: if errors are from gc-compaction, fire a warning instead of errors or critical errors. - Yield to L0 compaction more aggressively. Instead of checking every 1k keys, we check on every key. Sometimes a single key reconstruct takes a long time. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-07 21:19:06 +00:00
Erik Grinaker	99d8788756	pageserver: improve tenant manifest lifecycle (#11328 ) ## Problem Currently, the tenant manifest is only uploaded if there are offloaded timelines. The checks are also a bit loose (e.g. only checks number of offloaded timelines). We want to start using the manifest for other things too (e.g. stripe size). Resolves #11271. ## Summary of changes This patch ensures that a tenant manifest always exists. The lifecycle is: * During preload, fetch the existing manifest, if any. * During attach, upload a tenant manifest if it differs from the preloaded one (or does not exist). * Upload a new manifest as needed, if it differs from the last-known manifest (ignoring version number). * On splits, pre-populate the manifest from the parent. * During Pageserver physical GC, remove old manifests but keep the latest 2 generations. This will cause nearly all existing tenants to upload a new tenant manifest on their first attach after this change. Attaches are concurrency-limited in the storage controller, so we expect this will be fine. Also updates `make_broken` to automatically log at `INFO` level when the tenant has been cancelled, to avoid spurious error logs during shutdown.	2025-04-07 19:10:36 +00:00
Erik Grinaker	26c5c7e942	pageserver: set `Stopping` state on attach cancellation (#11462 ) ## Problem If a tenant is cancelled (e.g. due to Pageserver shutdown) during attach, it is set to `Broken`. This results both in error log spam and 500 responses for clients -- shutdown is supposed to return 503 responses which can be retried. This becomes more likely to happen with #11328, where we perform tenant manifest downloads/uploads during attach. ## Summary of changes Set tenant state to `Stopping` when attach fails and the tenant is cancelled, downgrading the log messages to INFO. This introduces two variants of `Stopping` -- with and without a caller barrier -- where the latter is used to signal attach cancellation.	2025-04-07 17:56:56 +00:00
Arpad Müller	8a2b19f467	Allow potential warning in test_storcon_create_delete_sk_down (#11466 ) Since merging #11400 and addition of `test_storcon_create_delete_sk_down`, we've seen an error occur multiple times. https://github.com/neondatabase/neon/pull/11400#issuecomment-2782528369	2025-04-07 16:52:54 +00:00
Arpad Müller	486872dd28	Add support to specify auth token via --auth-token-path (#11443 ) Before we specified the JWT via `SAFEKEEPER_AUTH_TOKEN`, but env vars are quite public, both in procfs as well as the unit files. So add a way to put the auth token into a file directly. context: https://neondb.slack.com/archives/C033RQ5SPDH/p1743692566311099	2025-04-07 16:12:04 +00:00
Alex Chi Z.	d37e90f430	fix(pageserver): allow shard ancestor compaction to be cancelled (#11452 ) ## Problem https://github.com/neondatabase/neon/issues/11330 https://github.com/neondatabase/neon/issues/11358 ## Summary of changes Looking at the staging log, a few tenants right after shard split are stuck on shutdown because they are running shard ancestor compaction. The compaction does not respect the cancellation token. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-07 16:01:21 +00:00
Konstantin Knizhnik	8eb701d706	Save FSM/VM pages on normal shutdown (#11449 ) ## Problem See https://neondb.slack.com/archives/C03QLRH7PPD/p1743746717119179 We wallow FSM/VM pages when they are written to disk to persist them in PS. But it is not happen during shutdown checkpoint, because writing to WAL during checkpoint cause Postgres panic. ## Summary of changes Move `CheckPointBuffers` call to `PreCheckPointGuts` Postgres PRs: https://github.com/neondatabase/postgres/pull/615 https://github.com/neondatabase/postgres/pull/614 https://github.com/neondatabase/postgres/pull/613 https://github.com/neondatabase/postgres/pull/612 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-07 13:56:55 +00:00