From 809acb5fa93411f44091ab6a56dddeffdcee0d89 Mon Sep 17 00:00:00 2001 From: Shany Pozin Date: Tue, 21 Mar 2023 19:32:36 +0200 Subject: [PATCH 01/23] Move neon-image-depot to a larger runner (#3860) ## Describe your changes https://neondb.slack.com/archives/C039YKBRZB4/p1679413279637059 ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --- .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index e056cf0fcf..d50a42d83c 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -552,7 +552,7 @@ jobs: neon-image-depot: # For testing this will run side-by-side for a few merges. # This action is not really optimized yet, but gets the job done - runs-on: [ self-hosted, gen3, small ] + runs-on: [ self-hosted, gen3, large ] needs: [ tag ] container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned permissions: From 4158e24e60d294e0f039395ea95dd87f8ab317d9 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Tue, 21 Mar 2023 20:03:27 +0200 Subject: [PATCH 02/23] rfc: delete pageserver data from s3 (#3792) [Rendered](https://github.com/neondatabase/neon/blob/main/docs/rfcs/022-pageserver-delete-from-s3.md) --------- Co-authored-by: Joonas Koivunen --- docs/rfcs/022-pageserver-delete-from-s3.md | 269 +++++++++++++++++++++ 1 file changed, 269 insertions(+) create mode 100644 docs/rfcs/022-pageserver-delete-from-s3.md diff --git a/docs/rfcs/022-pageserver-delete-from-s3.md b/docs/rfcs/022-pageserver-delete-from-s3.md new file mode 100644 index 0000000000..260e549670 --- /dev/null +++ b/docs/rfcs/022-pageserver-delete-from-s3.md @@ -0,0 +1,269 @@ +# Deleting pageserver part of tenants data from s3 + +Created on 08.03.23 + +## Motivation + +Currently we dont delete pageserver part of the data from s3 when project is deleted. (The same is true for safekeepers, but this outside of the scope of this RFC). + +This RFC aims to spin a discussion to come to a robust deletion solution that wont put us in into a corner for features like postponed deletion (when we keep data for user to be able to restore a project if it was deleted by accident) + +## Summary + +TLDR; There are two options, one based on control plane issuing actual delete requests to s3 and the other one that keeps s3 stuff bound to pageserver. Each one has its pros and cons. + +The decision is to stick with pageserver centric approach. For motivation see [Decision](#decision). + +## Components + +pageserver, control-plane + +## Requirements + +Deletion should successfully finish (eventually) without leaving dangling files in presense of: + +- component restarts +- component outage +- pageserver loss + +## Proposed implementation + +Before the options are discussed, note that deletion can be quite long process. For deletion from s3 the obvious choice is [DeleteObjects](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html) API call. It allows to batch deletion of up to 1k objects in one API call. So deletion operation linearly depends on number of layer files. + +Another design limitation is that there is no cheap `mv` operation available for s3. `mv` from `aws s3 mv` uses `copy(src, dst) + delete(src)`. So `mv`-like operation is not feasible as a building block because it actually amplifies the problem with both duration and resulting cost of the operation. + +The case when there are multiple pageservers handling the same tenants is largely out of scope of the RFC. We still consider case with migration from one PS to another, but do not consider case when tenant exists on multiple pageservers for extended period of time. The case with multiple pageservers can be reduced to case with one pageservers by calling detach on all pageservers except the last one, for it actual delete needs to be called. + +For simplicity lets look into deleting tenants. Differences in deletion process between tenants and timelines are mentioned in paragraph ["Differences between tenants and timelines"](#differences-between-tenants-and-timelines) + +### 1. Pageserver owns deletion machinery + +#### The sequence + +TLDR; With this approach control plane needs to call delete on a tenant and poll for progress. As much as possible is handled on pageserver. Lets see the sequence. + +Happy path: + +```mermaid +sequenceDiagram + autonumber + participant CP as Control Plane + participant PS as Pageserver + participant S3 + + CP->>PS: Delete tenant + PS->>S3: Create deleted mark file at
/tenant/meta/deleted + PS->>PS: Create deleted mark file locally + PS->>CP: Accepted + PS->>PS: delete local files other than deleted mark + loop Delete layers for each timeline + PS->>S3: delete(..) + CP->>PS: Finished? + PS->>CP: False + end + PS->>S3: Delete mark file + PS->>PS: Delete local mark file + + loop Poll for status + CP->>PS: Finished? + PS->>CP: True or False + end +``` + +Why two mark files? +Remote one is needed for cases when pageserver is lost during deletion so other pageserver can learn the deletion from s3 during attach. + +Why local mark file is needed? + +If we dont have one, we have two choices, delete local data before deleting the remote part or do that after. + +If we delete local data before remote then during restart pageserver wont pick up remote tenant at all because nothing is available locally (pageserver looks for remote conuterparts of locally available tenants). + +If we delete local data after remote then at the end of the sequence when remote mark file is deleted if pageserver restart happens then the state is the same to situation when pageserver just missing data on remote without knowing the fact that this data is intended to be deleted. In this case the current behavior is upload everything local-only to remote. + +Thus we need local record of tenant being deleted as well. + +##### Handle pageserver crashes + +Lets explore sequences with various crash points. + +Pageserver crashes before `deleted` mark file is persisted in s3: + +```mermaid +sequenceDiagram + autonumber + participant CP as Control Plane + participant PS as Pageserver + participant S3 + + CP->>PS: Delete tenant + note over PS: Crash point 1. + CP->>PS: Retry delete request + + PS->>S3: Create deleted mark file at
/tenant/meta/deleted + PS->>PS: Create deleted mark file locally + + PS->>CP: Accepted + + PS->>PS: delete local files other than deleted mark + + loop Delete layers for each timeline + PS->>S3: delete(..) + CP->>PS: Finished? + PS->>CP: False + end + PS->>S3: Delete mark file + PS->>PS: Delete local mark file + + CP->>PS: Finished? + PS->>CP: True +``` + +Pageserver crashed when deleted mark was about to be persisted in s3, before Control Plane gets a response: + +```mermaid +sequenceDiagram + autonumber + participant CP as Control Plane + participant PS as Pageserver + participant S3 + + CP->>PS: Delete tenant + PS->>S3: Create deleted mark file at
/tenant/meta/deleted + + note over PS: Crash point 2. + note over PS: During startup we reconcile
with remote and see
whether the remote mark exists + alt Remote mark exists + PS->>PS: create local mark if its missing + PS->>PS: delete local files other than deleted mark + loop Delete layers for each timeline + PS->>S3: delete(..) + end + + note over CP: Eventually console should
retry delete request + + CP->>PS: Retry delete tenant + PS->>CP: Not modified + else Mark is missing + note over PS: Continue to operate the tenant as if deletion didnt happen + + note over CP: Eventually console should
retry delete request + + CP->>PS: Retry delete tenant + PS->>S3: Create deleted mark file at
/tenant/meta/deleted + PS->>CP: Delete tenant + end + + PS->>PS: Continue with layer file deletions + loop Delete layers for each timeline + PS->>S3: delete(..) + CP->>PS: Finished? + PS->>CP: False + end + + PS->>S3: Delete mark file + PS->>PS: Delete local mark file + + CP->>PS: Finished? + PS->>CP: True +``` + +Similar sequence applies when both local and remote marks were persisted but Control Plane still didnt receive a response. + +If pageserver crashes after both mark files were deleted then it will reply to control plane status poll request with 404 which should be treated by control plane as success. + +The same applies if pageserver crashes in the end, when remote mark is deleted but before local one gets deleted. In this case on restart pageserver moves forward with deletion of local mark and Control Plane will receive 404. + +##### Differences between tenants and timelines + +For timeline the sequence is the same with the following differences: + +- remote delete mark file can be replaced with a boolean "deleted" flag in index_part.json +- local deletion mark is not needed, because whole tenant is kept locally so situation described in motivation for local mark is impossible + +##### Handle pageserver loss + +If pageseserver is lost then the deleted tenant should be attached to different pageserver and delete request needs to be retried against new pageserver. Then attach logic is shared with one described for pageserver restarts (local deletion mark wont be available so needs to be created). + +##### Restrictions for tenant that is in progress of being deleted + +I propose to add another state to tenant/timeline - PendingDelete. This state shouldnt allow executing any operations aside from polling the deletion status. + +#### Summary + +Pros: + +- Storage is not dependent on control plane. Storage can be restarted even if control plane is not working. +- Allows for easier dogfooding, console can use Neon backed database as primary operational data store. If storage depends on control plane and control plane depends on storage we're stuck. +- No need to share inner s3 workings with control plane. Pageserver presents api contract and S3 paths are not part of this contract. +- No need to pass list of alive timelines to attach call. This will be solved by pageserver observing deleted flag. See + +Cons: + +- Logic is a tricky, needs good testing +- Anything else? + +### 2. Control plane owns deletion machinery + +In this case the only action performed on pageserver is removal of local files. + +Everything else is done by control plane. The steps are as follows: + +1. Control plane marks tenant as "delete pending" in its database +2. It lists the s3 for all the files and repeatedly calls delete until nothing is left behind +3. When no files are left marks deletion as completed + +In case of restart it selects all tenants marked as "delete pending" and continues the deletion. + +For tenants it is simple. For timelines there are caveats. + +Assume that the same workflow is used for timelines. + +If a tenant gets relocated during timeline deletion the attach call with its current logic will pick up deleted timeline in its half deleted state. + +Available options: + +- require list of alive timelines to be passed to attach call +- use the same schema with flag in index_part.json (again part of the caveats around pageserver restart applies). In this case nothing stops pageserver from implementing deletion inside if we already have these deletion marks. + +With first option the following problem becomes apparent: + +Who is the source of truth regarding timeline liveness? + +Imagine: +PS1 fails. +PS2 gets assigned the tenant. +New branch gets created +PS1 starts up (is it possible or we just recycle it?) +PS1 is unaware of the new branch. It can either fall back to s3 ls, or ask control plane. + +So here comes the dependency of storage on control plane. During restart storage needs to know which timelines are valid for operation. If there is nothing on s3 that can answer that question storage neeeds to ask control plane. + +### Summary + +Cons: + +- Potential thundering herd-like problem during storage restart (requests to control plane) +- Potential increase in storage startup time (additional request to control plane) +- Storage startup starts to depend on console +- Erroneous attach call can attach tenant in half deleted state + +Pros: + +- Easier to reason about if you dont have to account for pageserver restarts + +### Extra notes + +There was a concern that having deletion code in pageserver is a littlebit scary, but we need to have this code somewhere. So to me it is equally scary to have that in whatever place it ends up at. + +Delayed deletion can be done with both approaches. As discussed with Anna (@stepashka) this is only relevant for tenants (projects) not for timelines. For first approach detach can be called immediately and deletion can be done later with attach + delete. With second approach control plane needs to start the deletion whenever necessary. + +## Decision + +After discussion in comments I see that we settled on two options (though a bit different from ones described in rfc). First one is the same - pageserver owns as much as possible. The second option is that pageserver owns markers thing, but actual deletion happens in control plane by repeatedly calling ls + delete. + +To my mind the only benefit of the latter approach is possible code reuse between safekeepers and pageservers. Otherwise poking around integrating s3 library into control plane, configuring shared knowledge abouth paths in s3 - are the downsides. Another downside of relying on control plane is the testing process. Control plane resides in different repository so it is quite hard to test pageserver related changes there. e2e test suite there doesnt support shutting down pageservers, which are separate docker containers there instead of just processes. + +With pageserver owning everything we still give the retry logic to control plane but its easier to duplicate if needed compared to sharing inner s3 workings. We will have needed tests for retry logic in neon repo. + +So the decision is to proceed with pageserver centric approach. From 6fdd9c10d18270a5e30704f17e573ea14ee978ce Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 21 Mar 2023 15:24:49 +0200 Subject: [PATCH 03/23] Read storage auth token from spec file. We read the pageserver connection string from the spec file, so let's read the auth token from the same place. We've been talking about pre-launching compute nodes that are not associated with any particular tenant at startup, so that the spec file is delivered to the compute node later. We cannot change the env variables after the process has been launched. We still pass the token to 'postgres' binary in the NEON_AUTH_TOKEN env variable, but compute_ctl is now responsible for setting it. --- compute_tools/src/bin/compute_ctl.rs | 2 ++ compute_tools/src/compute.rs | 29 +++++++++++++++++----------- compute_tools/src/spec.rs | 2 ++ 3 files changed, 22 insertions(+), 11 deletions(-) diff --git a/compute_tools/src/bin/compute_ctl.rs b/compute_tools/src/bin/compute_ctl.rs index a4e9262072..b96842e416 100644 --- a/compute_tools/src/bin/compute_ctl.rs +++ b/compute_tools/src/bin/compute_ctl.rs @@ -133,6 +133,7 @@ fn main() -> Result<()> { .settings .find("neon.pageserver_connstring") .expect("pageserver connstr should be provided"); + let storage_auth_token = spec.storage_auth_token.clone(); let tenant = spec .cluster .settings @@ -153,6 +154,7 @@ fn main() -> Result<()> { tenant, timeline, pageserver_connstr, + storage_auth_token, metrics: ComputeMetrics::default(), state: RwLock::new(ComputeState::new()), }; diff --git a/compute_tools/src/compute.rs b/compute_tools/src/compute.rs index 09272262de..00d1e234ab 100644 --- a/compute_tools/src/compute.rs +++ b/compute_tools/src/compute.rs @@ -45,6 +45,7 @@ pub struct ComputeNode { pub tenant: String, pub timeline: String, pub pageserver_connstr: String, + pub storage_auth_token: Option, pub metrics: ComputeMetrics, /// Volatile part of the `ComputeNode` so should be used under `RwLock` /// to allow HTTP API server to serve status requests, while configuration @@ -129,18 +130,14 @@ impl ComputeNode { let mut config = postgres::Config::from_str(&self.pageserver_connstr)?; - // Like in the neon extension, if the $NEON_AUTH_TOKEN env variable is - // set, use it as the password when connecting to pageserver. - // + // Use the storage auth token from the config file, if given. // Note: this overrides any password set in the connection string. - match std::env::var("NEON_AUTH_TOKEN") { - Ok(val) => { - info!("Got pageserver auth token from NEON_AUTH_TOKEN env variable"); - config.password(val); - } - Err(std::env::VarError::NotPresent) => info!("NEON_AUTH_TOKEN env variable not set"), - Err(e) => info!("could not parse NEON_AUTH_TOKEN env variable: {}", e), - }; + if let Some(storage_auth_token) = &self.storage_auth_token { + info!("Got storage auth token from spec file"); + config.password(storage_auth_token); + } else { + info!("Storage auth token not set"); + } let mut client = config.connect(NoTls)?; let basebackup_cmd = match lsn { @@ -179,6 +176,11 @@ impl ComputeNode { let sync_handle = Command::new(&self.pgbin) .args(["--sync-safekeepers"]) .env("PGDATA", &self.pgdata) // we cannot use -D in this mode + .envs(if let Some(storage_auth_token) = &self.storage_auth_token { + vec![("NEON_AUTH_TOKEN", storage_auth_token)] + } else { + vec![] + }) .stdout(Stdio::piped()) .spawn() .expect("postgres --sync-safekeepers failed to start"); @@ -256,6 +258,11 @@ impl ComputeNode { // Run postgres as a child process. let mut pg = Command::new(&self.pgbin) .args(["-D", &self.pgdata]) + .envs(if let Some(storage_auth_token) = &self.storage_auth_token { + vec![("NEON_AUTH_TOKEN", storage_auth_token)] + } else { + vec![] + }) .spawn() .expect("cannot start postgres process"); diff --git a/compute_tools/src/spec.rs b/compute_tools/src/spec.rs index 47f1d69cff..9694ba9a88 100644 --- a/compute_tools/src/spec.rs +++ b/compute_tools/src/spec.rs @@ -24,6 +24,8 @@ pub struct ComputeSpec { pub cluster: Cluster, pub delta_operations: Option>, + pub storage_auth_token: Option, + pub startup_tracing_context: Option>, } From dd22c871003275d1087a9a5a4948f030ad6a8eda Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Tue, 21 Mar 2023 23:33:28 +0200 Subject: [PATCH 04/23] Remove older layer metadata format support code (#3854) The PR enforces current newest `index_part.json` format in the type system (version `1`), not allowing any previous forms of it, that were used in the past. Similarly, the code to mitigate the https://github.com/neondatabase/neon/issues/3024 issue is now also removed. Current code does not produce old formats and extra files in the index_part.json, in the future we will be able to use https://github.com/neondatabase/aversion or other approach to make version transitions more explicit. See https://neondb.slack.com/archives/C033RQ5SPDH/p1679134185248119 for the justification on the breaking changes. --- libs/pageserver_api/src/models.rs | 4 +- pageserver/src/http/routes.rs | 4 +- .../src/tenant/remote_timeline_client.rs | 38 +-- .../tenant/remote_timeline_client/download.rs | 21 +- .../tenant/remote_timeline_client/index.rs | 219 +++++------------- .../tenant/remote_timeline_client/upload.rs | 10 +- pageserver/src/tenant/storage_layer.rs | 2 +- .../src/tenant/storage_layer/delta_layer.rs | 6 +- .../src/tenant/storage_layer/filename.rs | 9 + .../src/tenant/storage_layer/image_layer.rs | 6 +- .../src/tenant/storage_layer/remote_layer.rs | 2 +- pageserver/src/tenant/timeline.rs | 88 +++---- pageserver/src/tenant/upload_queue.rs | 15 +- .../test_tenants_with_remote_storage.py | 200 ---------------- 14 files changed, 132 insertions(+), 492 deletions(-) diff --git a/libs/pageserver_api/src/models.rs b/libs/pageserver_api/src/models.rs index 7a43100ba5..0f860d0a6d 100644 --- a/libs/pageserver_api/src/models.rs +++ b/libs/pageserver_api/src/models.rs @@ -346,7 +346,7 @@ pub enum InMemoryLayerInfo { pub enum HistoricLayerInfo { Delta { layer_file_name: String, - layer_file_size: Option, + layer_file_size: u64, #[serde_as(as = "DisplayFromStr")] lsn_start: Lsn, @@ -357,7 +357,7 @@ pub enum HistoricLayerInfo { }, Image { layer_file_name: String, - layer_file_size: Option, + layer_file_size: u64, #[serde_as(as = "DisplayFromStr")] lsn_start: Lsn, diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 39f2776952..d91e421a52 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -185,7 +185,7 @@ fn build_timeline_info_common( None } }; - let current_physical_size = Some(timeline.layer_size_sum().approximate_is_ok()); + let current_physical_size = Some(timeline.layer_size_sum()); let state = timeline.current_state(); let remote_consistent_lsn = timeline.get_remote_consistent_lsn().unwrap_or(Lsn(0)); @@ -451,7 +451,7 @@ async fn tenant_status(request: Request) -> Result, ApiErro // Calculate total physical size of all timelines let mut current_physical_size = 0; for timeline in tenant.list_timelines().iter() { - current_physical_size += timeline.layer_size_sum().approximate_is_ok(); + current_physical_size += timeline.layer_size_sum(); } let state = tenant.current_state(); diff --git a/pageserver/src/tenant/remote_timeline_client.rs b/pageserver/src/tenant/remote_timeline_client.rs index f3943298f2..28c4943dbd 100644 --- a/pageserver/src/tenant/remote_timeline_client.rs +++ b/pageserver/src/tenant/remote_timeline_client.rs @@ -210,7 +210,6 @@ pub use download::{is_temp_download_file, list_remote_timelines}; use std::sync::atomic::{AtomicU32, Ordering}; use std::sync::{Arc, Mutex}; -use anyhow::ensure; use remote_storage::{DownloadError, GenericRemoteStorage}; use std::ops::DerefMut; use tokio::runtime::Runtime; @@ -347,7 +346,7 @@ impl RemoteTimelineClient { .layer_metadata .values() // If we don't have the file size for the layer, don't account for it in the metric. - .map(|ilmd| ilmd.file_size.unwrap_or(0)) + .map(|ilmd| ilmd.file_size) .sum() } else { 0 @@ -420,34 +419,6 @@ impl RemoteTimelineClient { .await? }; - // Update the metadata for given layer file. The remote index file - // might be missing some information for the file; this allows us - // to fill in the missing details. - if layer_metadata.file_size().is_none() { - let new_metadata = LayerFileMetadata::new(downloaded_size); - let mut guard = self.upload_queue.lock().unwrap(); - let upload_queue = guard.initialized_mut()?; - if let Some(upgraded) = upload_queue.latest_files.get_mut(layer_file_name) { - if upgraded.merge(&new_metadata) { - upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1; - } - // If we don't do an index file upload inbetween here and restart, - // the value will go back down after pageserver restart, since we will - // have lost this data point. - // But, we upload index part fairly frequently, and restart pageserver rarely. - // So, by accounting eagerly, we present a most-of-the-time-more-accurate value sooner. - self.metrics - .remote_physical_size_gauge() - .add(downloaded_size); - } else { - // The file should exist, since we just downloaded it. - warn!( - "downloaded file {:?} not found in local copy of the index file", - layer_file_name - ); - } - } - REMOTE_ONDEMAND_DOWNLOADED_LAYERS.inc(); REMOTE_ONDEMAND_DOWNLOADED_BYTES.inc_by(downloaded_size); @@ -550,13 +521,6 @@ impl RemoteTimelineClient { let mut guard = self.upload_queue.lock().unwrap(); let upload_queue = guard.initialized_mut()?; - // The file size can be missing for files that were created before we tracked that - // in the metadata, but it should be present for any new files we create. - ensure!( - layer_metadata.file_size().is_some(), - "file size not initialized in metadata" - ); - upload_queue .latest_files .insert(layer_file_name.clone(), layer_metadata.clone()); diff --git a/pageserver/src/tenant/remote_timeline_client/download.rs b/pageserver/src/tenant/remote_timeline_client/download.rs index ea8d9858c3..bda095d850 100644 --- a/pageserver/src/tenant/remote_timeline_client/download.rs +++ b/pageserver/src/tenant/remote_timeline_client/download.rs @@ -21,7 +21,7 @@ use remote_storage::{DownloadError, GenericRemoteStorage}; use utils::crashsafe::path_with_suffix_extension; use utils::id::{TenantId, TimelineId}; -use super::index::{IndexPart, IndexPartUnclean, LayerFileMetadata}; +use super::index::{IndexPart, LayerFileMetadata}; use super::{FAILED_DOWNLOAD_RETRIES, FAILED_DOWNLOAD_WARN_THRESHOLD}; async fn fsync_path(path: impl AsRef) -> Result<(), std::io::Error> { @@ -113,16 +113,11 @@ pub async fn download_layer_file<'a>( }) .map_err(DownloadError::Other)?; - match layer_metadata.file_size() { - Some(expected) if expected != bytes_amount => { - return Err(DownloadError::Other(anyhow!( - "According to layer file metadata should have downloaded {expected} bytes but downloaded {bytes_amount} bytes into file '{}'", - temp_file_path.display() - ))); - } - Some(_) | None => { - // matches, or upgrading from an earlier IndexPart version - } + let expected = layer_metadata.file_size(); + if expected != bytes_amount { + return Err(DownloadError::Other(anyhow!( + "According to layer file metadata should have downloaded {expected} bytes but downloaded {bytes_amount} bytes into file {temp_file_path:?}", + ))); } // not using sync_data because it can lose file size update @@ -261,14 +256,12 @@ pub(super) async fn download_index_part( ) .await?; - let index_part: IndexPartUnclean = serde_json::from_slice(&index_part_bytes) + let index_part: IndexPart = serde_json::from_slice(&index_part_bytes) .with_context(|| { format!("Failed to deserialize index part file into file {index_part_path:?}") }) .map_err(DownloadError::Other)?; - let index_part = index_part.remove_unclean_layer_file_names(); - Ok(index_part) } diff --git a/pageserver/src/tenant/remote_timeline_client/index.rs b/pageserver/src/tenant/remote_timeline_client/index.rs index 420edae6cd..9c84f8e977 100644 --- a/pageserver/src/tenant/remote_timeline_client/index.rs +++ b/pageserver/src/tenant/remote_timeline_client/index.rs @@ -6,7 +6,6 @@ use std::collections::{HashMap, HashSet}; use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; -use tracing::warn; use crate::tenant::metadata::TimelineMetadata; use crate::tenant::storage_layer::LayerFileName; @@ -20,7 +19,7 @@ use utils::lsn::Lsn; #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)] #[cfg_attr(test, derive(Default))] pub struct LayerFileMetadata { - file_size: Option, + file_size: u64, } impl From<&'_ IndexLayerMetadata> for LayerFileMetadata { @@ -33,36 +32,16 @@ impl From<&'_ IndexLayerMetadata> for LayerFileMetadata { impl LayerFileMetadata { pub fn new(file_size: u64) -> Self { - LayerFileMetadata { - file_size: Some(file_size), - } + LayerFileMetadata { file_size } } - /// This is used to initialize the metadata for remote layers, for which - /// the metadata was missing from the index part file. - pub const MISSING: Self = LayerFileMetadata { file_size: None }; - - pub fn file_size(&self) -> Option { + pub fn file_size(&self) -> u64 { self.file_size } - - /// Metadata has holes due to version upgrades. This method is called to upgrade self with the - /// other value. - /// - /// This is called on the possibly outdated version. Returns true if any changes - /// were made. - pub fn merge(&mut self, other: &Self) -> bool { - let mut changed = false; - - if self.file_size != other.file_size { - self.file_size = other.file_size.or(self.file_size); - changed = true; - } - - changed - } } +// TODO seems like another part of the remote storage file format +// compatibility issue, see https://github.com/neondatabase/neon/issues/3072 /// In-memory representation of an `index_part.json` file /// /// Contains the data about all files in the timeline, present remotely and its metadata. @@ -71,10 +50,7 @@ impl LayerFileMetadata { /// remember to add a test case for the changed version. #[serde_as] #[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)] -pub struct IndexPartImpl -where - L: std::hash::Hash + PartialEq + Eq, -{ +pub struct IndexPart { /// Debugging aid describing the version of this type. #[serde(default)] version: usize, @@ -82,14 +58,13 @@ where /// Layer names, which are stored on the remote storage. /// /// Additional metadata can might exist in `layer_metadata`. - pub timeline_layers: HashSet, + pub timeline_layers: HashSet, /// Per layer file name metadata, which can be present for a present or missing layer file. /// /// Older versions of `IndexPart` will not have this property or have only a part of metadata /// that latest version stores. - #[serde(default = "HashMap::default")] - pub layer_metadata: HashMap, + pub layer_metadata: HashMap, // 'disk_consistent_lsn' is a copy of the 'disk_consistent_lsn' in the metadata. // It's duplicated here for convenience. @@ -98,101 +73,6 @@ where metadata_bytes: Vec, } -// TODO seems like another part of the remote storage file format -// compatibility issue, see https://github.com/neondatabase/neon/issues/3072 -pub type IndexPart = IndexPartImpl; - -pub type IndexPartUnclean = IndexPartImpl; - -#[derive(Debug, PartialEq, Eq, Hash, Clone)] -pub enum UncleanLayerFileName { - Clean(LayerFileName), - BackupFile(String), -} - -impl<'de> serde::Deserialize<'de> for UncleanLayerFileName { - fn deserialize(deserializer: D) -> Result - where - D: serde::Deserializer<'de>, - { - deserializer.deserialize_string(UncleanLayerFileNameVisitor) - } -} - -struct UncleanLayerFileNameVisitor; - -impl<'de> serde::de::Visitor<'de> for UncleanLayerFileNameVisitor { - type Value = UncleanLayerFileName; - - fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result { - write!( - formatter, - "a string that is a valid LayerFileName or '.old' backup file name" - ) - } - - fn visit_str(self, v: &str) -> Result - where - E: serde::de::Error, - { - let maybe_clean: Result = v.parse(); - match maybe_clean { - Ok(clean) => Ok(UncleanLayerFileName::Clean(clean)), - Err(e) => { - if v.ends_with(".old") || v == "metadata_backup" { - Ok(UncleanLayerFileName::BackupFile(v.to_owned())) - } else { - Err(E::custom(e)) - } - } - } - } -} - -impl UncleanLayerFileName { - fn into_clean(self) -> Option { - match self { - UncleanLayerFileName::Clean(clean) => Some(clean), - UncleanLayerFileName::BackupFile(_) => None, - } - } -} - -impl IndexPartUnclean { - pub fn remove_unclean_layer_file_names(self) -> IndexPart { - let IndexPartUnclean { - version, - timeline_layers, - layer_metadata, - disk_consistent_lsn, - metadata_bytes, - } = self; - - IndexPart { - version, - timeline_layers: timeline_layers - .into_iter() - .filter_map(|unclean_file_name| match unclean_file_name { - UncleanLayerFileName::Clean(clean_name) => Some(clean_name), - UncleanLayerFileName::BackupFile(backup_file_name) => { - // For details see https://github.com/neondatabase/neon/issues/3024 - warn!( - "got backup file on the remote storage, ignoring it {backup_file_name}" - ); - None - } - }) - .collect(), - layer_metadata: layer_metadata - .into_iter() - .filter_map(|(l, m)| l.into_clean().map(|l| (l, m))) - .collect(), - disk_consistent_lsn, - metadata_bytes, - } - } -} - impl IndexPart { /// When adding or modifying any parts of `IndexPart`, increment the version so that it can be /// used to understand later versions. @@ -232,7 +112,7 @@ impl IndexPart { /// Serialized form of [`LayerFileMetadata`]. #[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize, Default)] pub struct IndexLayerMetadata { - pub(super) file_size: Option, + pub(super) file_size: u64, } impl From<&'_ LayerFileMetadata> for IndexLayerMetadata { @@ -247,27 +127,6 @@ impl From<&'_ LayerFileMetadata> for IndexLayerMetadata { mod tests { use super::*; - #[test] - fn v0_indexpart_is_parsed() { - let example = r#"{ - "timeline_layers":["000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9"], - "disk_consistent_lsn":"0/16960E8", - "metadata_bytes":[113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] - }"#; - - let expected = IndexPart { - version: 0, - timeline_layers: HashSet::from(["000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap()]), - layer_metadata: HashMap::default(), - disk_consistent_lsn: "0/16960E8".parse::().unwrap(), - metadata_bytes: [113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0].to_vec(), - }; - - let part: IndexPartUnclean = serde_json::from_str(example).unwrap(); - let part = part.remove_unclean_layer_file_names(); - assert_eq!(part, expected); - } - #[test] fn v1_indexpart_is_parsed() { let example = r#"{ @@ -287,21 +146,19 @@ mod tests { timeline_layers: HashSet::from(["000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap()]), layer_metadata: HashMap::from([ ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata { - file_size: Some(25600000), + file_size: 25600000, }), ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata { // serde_json should always parse this but this might be a double with jq for // example. - file_size: Some(9007199254741001), + file_size: 9007199254741001, }) ]), disk_consistent_lsn: "0/16960E8".parse::().unwrap(), metadata_bytes: [113,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0].to_vec(), }; - let part = serde_json::from_str::(example) - .unwrap() - .remove_unclean_layer_file_names(); + let part = serde_json::from_str::(example).unwrap(); assert_eq!(part, expected); } @@ -325,20 +182,64 @@ mod tests { timeline_layers: HashSet::from(["000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap()]), layer_metadata: HashMap::from([ ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata { - file_size: Some(25600000), + file_size: 25600000, }), ("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata { // serde_json should always parse this but this might be a double with jq for // example. - file_size: Some(9007199254741001), + file_size: 9007199254741001, }) ]), disk_consistent_lsn: "0/16960E8".parse::().unwrap(), metadata_bytes: [112,11,159,210,0,54,0,4,0,0,0,0,1,105,96,232,1,0,0,0,0,1,105,96,112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,105,96,112,0,0,0,0,1,105,96,112,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0].to_vec(), }; - let part = serde_json::from_str::(example).unwrap(); - let part = part.remove_unclean_layer_file_names(); + let part = serde_json::from_str::(example).unwrap(); assert_eq!(part, expected); } + + #[test] + fn empty_layers_are_parsed() { + let empty_layers_json = r#"{ + "version":1, + "timeline_layers":[], + "layer_metadata":{}, + "disk_consistent_lsn":"0/2532648", + "metadata_bytes":[136,151,49,208,0,70,0,4,0,0,0,0,2,83,38,72,1,0,0,0,0,2,83,38,32,1,87,198,240,135,97,119,45,125,38,29,155,161,140,141,255,210,0,0,0,0,2,83,38,72,0,0,0,0,1,73,240,192,0,0,0,0,1,73,240,192,0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] + }"#; + + let expected = IndexPart { + version: 1, + timeline_layers: HashSet::new(), + layer_metadata: HashMap::new(), + disk_consistent_lsn: "0/2532648".parse::().unwrap(), + metadata_bytes: [ + 136, 151, 49, 208, 0, 70, 0, 4, 0, 0, 0, 0, 2, 83, 38, 72, 1, 0, 0, 0, 0, 2, 83, + 38, 32, 1, 87, 198, 240, 135, 97, 119, 45, 125, 38, 29, 155, 161, 140, 141, 255, + 210, 0, 0, 0, 0, 2, 83, 38, 72, 0, 0, 0, 0, 1, 73, 240, 192, 0, 0, 0, 0, 1, 73, + 240, 192, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, + ] + .to_vec(), + }; + + let empty_layers_parsed = serde_json::from_str::(empty_layers_json).unwrap(); + + assert_eq!(empty_layers_parsed, expected); + } } diff --git a/pageserver/src/tenant/remote_timeline_client/upload.rs b/pageserver/src/tenant/remote_timeline_client/upload.rs index 5082fa1634..ce9f4d9bf8 100644 --- a/pageserver/src/tenant/remote_timeline_client/upload.rs +++ b/pageserver/src/tenant/remote_timeline_client/upload.rs @@ -64,13 +64,9 @@ pub(super) async fn upload_timeline_layer<'a>( })? .len(); - // FIXME: this looks bad - if let Some(metadata_size) = known_metadata.file_size() { - if metadata_size != fs_size { - bail!("File {source_path:?} has its current FS size {fs_size} diferent from initially determined {metadata_size}"); - } - } else { - // this is a silly state we would like to avoid + let metadata_size = known_metadata.file_size(); + if metadata_size != fs_size { + bail!("File {source_path:?} has its current FS size {fs_size} diferent from initially determined {metadata_size}"); } let fs_size = usize::try_from(fs_size).with_context(|| { diff --git a/pageserver/src/tenant/storage_layer.rs b/pageserver/src/tenant/storage_layer.rs index 52ce2cab42..c36b6121c0 100644 --- a/pageserver/src/tenant/storage_layer.rs +++ b/pageserver/src/tenant/storage_layer.rs @@ -378,7 +378,7 @@ pub trait PersistentLayer: Layer { /// /// Should not change over the lifetime of the layer object because /// current_physical_size is computed as the som of this value. - fn file_size(&self) -> Option; + fn file_size(&self) -> u64; fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo; diff --git a/pageserver/src/tenant/storage_layer/delta_layer.rs b/pageserver/src/tenant/storage_layer/delta_layer.rs index 37719dfce5..98cbcc5f07 100644 --- a/pageserver/src/tenant/storage_layer/delta_layer.rs +++ b/pageserver/src/tenant/storage_layer/delta_layer.rs @@ -444,8 +444,8 @@ impl PersistentLayer for DeltaLayer { Ok(()) } - fn file_size(&self) -> Option { - Some(self.file_size) + fn file_size(&self) -> u64 { + self.file_size } fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo { @@ -456,7 +456,7 @@ impl PersistentLayer for DeltaLayer { HistoricLayerInfo::Delta { layer_file_name, - layer_file_size: Some(self.file_size), + layer_file_size: self.file_size, lsn_start: lsn_range.start, lsn_end: lsn_range.end, remote: false, diff --git a/pageserver/src/tenant/storage_layer/filename.rs b/pageserver/src/tenant/storage_layer/filename.rs index efd0769886..e2112fc388 100644 --- a/pageserver/src/tenant/storage_layer/filename.rs +++ b/pageserver/src/tenant/storage_layer/filename.rs @@ -258,6 +258,15 @@ impl serde::Serialize for LayerFileName { } } +impl<'de> serde::Deserialize<'de> for LayerFileName { + fn deserialize(deserializer: D) -> Result + where + D: serde::Deserializer<'de>, + { + deserializer.deserialize_string(LayerFileNameVisitor) + } +} + struct LayerFileNameVisitor; impl<'de> serde::de::Visitor<'de> for LayerFileNameVisitor { diff --git a/pageserver/src/tenant/storage_layer/image_layer.rs b/pageserver/src/tenant/storage_layer/image_layer.rs index e37e001eda..a99b1b491f 100644 --- a/pageserver/src/tenant/storage_layer/image_layer.rs +++ b/pageserver/src/tenant/storage_layer/image_layer.rs @@ -258,8 +258,8 @@ impl PersistentLayer for ImageLayer { Ok(()) } - fn file_size(&self) -> Option { - Some(self.file_size) + fn file_size(&self) -> u64 { + self.file_size } fn info(&self, reset: LayerAccessStatsReset) -> HistoricLayerInfo { @@ -268,7 +268,7 @@ impl PersistentLayer for ImageLayer { HistoricLayerInfo::Image { layer_file_name, - layer_file_size: Some(self.file_size), + layer_file_size: self.file_size, lsn_start: lsn_range.start, remote: false, access_stats: self.access_stats.as_api_model(reset), diff --git a/pageserver/src/tenant/storage_layer/remote_layer.rs b/pageserver/src/tenant/storage_layer/remote_layer.rs index dbce2e7888..2eb7eb0cb6 100644 --- a/pageserver/src/tenant/storage_layer/remote_layer.rs +++ b/pageserver/src/tenant/storage_layer/remote_layer.rs @@ -167,7 +167,7 @@ impl PersistentLayer for RemoteLayer { true } - fn file_size(&self) -> Option { + fn file_size(&self) -> u64 { self.layer_metadata.file_size() } diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs index f5dbe63b0b..4d03a78883 100644 --- a/pageserver/src/tenant/timeline.rs +++ b/pageserver/src/tenant/timeline.rs @@ -334,25 +334,6 @@ impl LogicalSize { } } -/// Returned by [`Timeline::layer_size_sum`] -pub enum LayerSizeSum { - /// The result is accurate. - Accurate(u64), - // We don't know the layer file size of one or more layers. - // They contribute to the sum with a value of 0. - // Hence, the sum is a lower bound for the actualy layer file size sum. - ApproximateLowerBound(u64), -} - -impl LayerSizeSum { - pub fn approximate_is_ok(self) -> u64 { - match self { - LayerSizeSum::Accurate(v) => v, - LayerSizeSum::ApproximateLowerBound(v) => v, - } - } -} - pub struct WalReceiverInfo { pub wal_source_connconf: PgConnectionConfig, pub last_received_msg_lsn: Lsn, @@ -550,20 +531,13 @@ impl Timeline { /// The sum of the file size of all historic layers in the layer map. /// This method makes no distinction between local and remote layers. /// Hence, the result **does not represent local filesystem usage**. - pub fn layer_size_sum(&self) -> LayerSizeSum { + pub fn layer_size_sum(&self) -> u64 { let layer_map = self.layers.read().unwrap(); let mut size = 0; - let mut no_size_cnt = 0; for l in layer_map.iter_historic_layers() { - let (l_size, l_no_size) = l.file_size().map(|s| (s, 0)).unwrap_or((0, 1)); - size += l_size; - no_size_cnt += l_no_size; - } - if no_size_cnt == 0 { - LayerSizeSum::Accurate(size) - } else { - LayerSizeSum::ApproximateLowerBound(size) + size += l.file_size(); } + size } pub fn get_resident_physical_size(&self) -> u64 { @@ -1047,9 +1021,7 @@ impl Timeline { return Ok(false); } - let layer_file_size = local_layer - .file_size() - .expect("Local layer should have a file size"); + let layer_file_size = local_layer.file_size(); let local_layer_mtime = local_layer .local_path() @@ -1514,7 +1486,12 @@ impl Timeline { .layer_metadata .get(remote_layer_name) .map(LayerFileMetadata::from) - .unwrap_or(LayerFileMetadata::MISSING); + .with_context(|| { + format!( + "No remote layer metadata found for layer {}", + remote_layer_name.file_name() + ) + })?; // Is the local layer's size different from the size stored in the // remote index file? @@ -1530,34 +1507,27 @@ impl Timeline { local_layer_path.display() ); - if let Some(remote_size) = remote_layer_metadata.file_size() { - let metadata = local_layer_path.metadata().with_context(|| { - format!( - "get file size of local layer {}", - local_layer_path.display() - ) - })?; - let local_size = metadata.len(); - if local_size != remote_size { - warn!("removing local file {local_layer_path:?} because it has unexpected length {local_size}; length in remote index is {remote_size}"); - if let Err(err) = rename_to_backup(&local_layer_path) { - assert!(local_layer_path.exists(), "we would leave the local_layer without a file if this does not hold: {}", local_layer_path.display()); - anyhow::bail!("could not rename file {local_layer_path:?}: {err:?}"); - } else { - self.metrics.resident_physical_size_gauge.sub(local_size); - updates.remove_historic(local_layer); - // fall-through to adding the remote layer - } + let remote_size = remote_layer_metadata.file_size(); + let metadata = local_layer_path.metadata().with_context(|| { + format!( + "get file size of local layer {}", + local_layer_path.display() + ) + })?; + let local_size = metadata.len(); + if local_size != remote_size { + warn!("removing local file {local_layer_path:?} because it has unexpected length {local_size}; length in remote index is {remote_size}"); + if let Err(err) = rename_to_backup(&local_layer_path) { + assert!(local_layer_path.exists(), "we would leave the local_layer without a file if this does not hold: {}", local_layer_path.display()); + anyhow::bail!("could not rename file {local_layer_path:?}: {err:?}"); } else { - debug!( - "layer is present locally and file size matches remote, using it: {}", - local_layer_path.display() - ); - continue; + self.metrics.resident_physical_size_gauge.sub(local_size); + updates.remove_historic(local_layer); + // fall-through to adding the remote layer } } else { debug!( - "layer is present locally and remote does not have file size, using it: {}", + "layer is present locally and file size matches remote, using it: {}", local_layer_path.display() ); continue; @@ -1984,9 +1954,7 @@ impl Timeline { ) -> anyhow::Result<()> { if !layer.is_remote_layer() { layer.delete_resident_layer_file()?; - let layer_file_size = layer - .file_size() - .expect("Local layer should have a file size"); + let layer_file_size = layer.file_size(); self.metrics .resident_physical_size_gauge .sub(layer_file_size); diff --git a/pageserver/src/tenant/upload_queue.rs b/pageserver/src/tenant/upload_queue.rs index 790b2f59aa..08bc1f219d 100644 --- a/pageserver/src/tenant/upload_queue.rs +++ b/pageserver/src/tenant/upload_queue.rs @@ -127,12 +127,21 @@ impl UploadQueue { let mut files = HashMap::with_capacity(index_part.timeline_layers.len()); for layer_name in &index_part.timeline_layers { - let layer_metadata = index_part + match index_part .layer_metadata .get(layer_name) .map(LayerFileMetadata::from) - .unwrap_or(LayerFileMetadata::MISSING); - files.insert(layer_name.to_owned(), layer_metadata); + { + Some(layer_metadata) => { + files.insert(layer_name.to_owned(), layer_metadata); + } + None => { + anyhow::bail!( + "No remote layer metadata found for layer {}", + layer_name.file_name() + ); + } + } } let index_part_metadata = index_part.parse_metadata()?; diff --git a/test_runner/regress/test_tenants_with_remote_storage.py b/test_runner/regress/test_tenants_with_remote_storage.py index 769bc10280..c786f8a8e1 100644 --- a/test_runner/regress/test_tenants_with_remote_storage.py +++ b/test_runner/regress/test_tenants_with_remote_storage.py @@ -9,7 +9,6 @@ import asyncio import json import os -import shutil from pathlib import Path from typing import List, Tuple @@ -217,208 +216,9 @@ def test_tenants_attached_after_download( assert env.pageserver.log_contains(".*download .* succeeded after 1 retries.*") -@pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS]) -def test_tenant_upgrades_index_json_from_v0( - neon_env_builder: NeonEnvBuilder, remote_storage_kind: RemoteStorageKind -): - # the "image" for the v0 index_part.json. the fields themselves are - # replaced with values read from the later version because of #2592 (initdb - # lsn not reproducible). - v0_skeleton = json.loads( - """{ - "timeline_layers":[ - "000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9" - ], - "missing_layers":["This should not fail as its not used anymore"], - "disk_consistent_lsn":"0/16960E8", - "metadata_bytes":[] - }""" - ) - - # getting a too eager compaction happening for this test would not play - # well with the strict assertions. - neon_env_builder.pageserver_config_override = "tenant_config.compaction_period='1h'" - - neon_env_builder.enable_remote_storage( - remote_storage_kind, "test_tenant_upgrades_index_json_from_v0" - ) - - # launch pageserver, populate the default tenants timeline, wait for it to be uploaded, - # then go ahead and modify the "remote" version as if it was downgraded, needing upgrade - env = neon_env_builder.init_start() - - pageserver_http = env.pageserver.http_client() - pg = env.postgres.create_start("main") - - tenant_id = TenantId(pg.safe_psql("show neon.tenant_id")[0][0]) - timeline_id = TimelineId(pg.safe_psql("show neon.timeline_id")[0][0]) - - with pg.cursor() as cur: - cur.execute("CREATE TABLE t0 AS VALUES (123, 'second column as text');") - current_lsn = Lsn(query_scalar(cur, "SELECT pg_current_wal_flush_lsn()")) - - # flush, wait until in remote storage - wait_for_last_record_lsn(pageserver_http, tenant_id, timeline_id, current_lsn) - pageserver_http.timeline_checkpoint(tenant_id, timeline_id) - wait_for_upload(pageserver_http, tenant_id, timeline_id, current_lsn) - env.postgres.stop_all() - env.pageserver.stop() - - # remove all local data for the tenant to force redownloading and subsequent upgrade - shutil.rmtree(Path(env.repo_dir) / "tenants" / str(tenant_id)) - - # downgrade the remote file - timeline_path = local_fs_index_part_path(env, tenant_id, timeline_id) - with open(timeline_path, "r+") as timeline_file: - # keep the deserialized for later inspection - orig_index_part = json.load(timeline_file) - - v0_index_part = { - key: orig_index_part[key] - for key in v0_skeleton.keys() - ["missing_layers"] # pgserver doesn't have it anymore - } - - timeline_file.seek(0) - json.dump(v0_index_part, timeline_file) - timeline_file.truncate(timeline_file.tell()) - - env.pageserver.start() - pageserver_http = env.pageserver.http_client() - pageserver_http.tenant_attach(tenant_id) - - wait_until( - number_of_iterations=5, - interval=1, - func=lambda: assert_tenant_status(pageserver_http, tenant_id, "Active"), - ) - - pg = env.postgres.create_start("main") - - with pg.cursor() as cur: - cur.execute("INSERT INTO t0 VALUES (234, 'test data');") - current_lsn = Lsn(query_scalar(cur, "SELECT pg_current_wal_flush_lsn()")) - - wait_for_last_record_lsn(pageserver_http, tenant_id, timeline_id, current_lsn) - pageserver_http.timeline_checkpoint(tenant_id, timeline_id) - wait_for_upload(pageserver_http, tenant_id, timeline_id, current_lsn) - - # not needed anymore - env.postgres.stop_all() - env.pageserver.stop() - - # make sure the file has been upgraded back to how it started - index_part = local_fs_index_part(env, tenant_id, timeline_id) - assert index_part["version"] == orig_index_part["version"] - assert "missing_layers" not in index_part.keys() - - # expect one more layer because of the forced checkpoint - assert len(index_part["timeline_layers"]) == len(orig_index_part["timeline_layers"]) + 1 - - # all of the same layer files are there, but they might be shuffled around - orig_layers = set(orig_index_part["timeline_layers"]) - later_layers = set(index_part["timeline_layers"]) - assert later_layers.issuperset(orig_layers) - - added_layers = later_layers - orig_layers - assert len(added_layers) == 1 - - # all of metadata has been regenerated (currently just layer file size) - all_metadata_keys = set() - for layer in orig_layers: - orig_metadata = orig_index_part["layer_metadata"][layer] - new_metadata = index_part["layer_metadata"][layer] - assert ( - orig_metadata == new_metadata - ), f"metadata for layer {layer} should not have changed {orig_metadata} vs. {new_metadata}" - all_metadata_keys |= set(orig_metadata.keys()) - - one_new_layer = next(iter(added_layers)) - assert one_new_layer in index_part["layer_metadata"], "new layer should have metadata" - - only_new_metadata = index_part["layer_metadata"][one_new_layer] - - assert ( - set(only_new_metadata.keys()).symmetric_difference(all_metadata_keys) == set() - ), "new layer metadata has same metadata as others" - - # FIXME: test index_part.json getting downgraded from imaginary new version -@pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS]) -def test_tenant_ignores_backup_file( - neon_env_builder: NeonEnvBuilder, remote_storage_kind: RemoteStorageKind -): - # getting a too eager compaction happening for this test would not play - # well with the strict assertions. - neon_env_builder.pageserver_config_override = "tenant_config.compaction_period='1h'" - - neon_env_builder.enable_remote_storage(remote_storage_kind, "test_tenant_ignores_backup_file") - - # launch pageserver, populate the default tenants timeline, wait for it to be uploaded, - # then go ahead and modify the "remote" version as if it was downgraded, needing upgrade - env = neon_env_builder.init_start() - - env.pageserver.allowed_errors.append(".*got backup file on the remote storage, ignoring it.*") - - pageserver_http = env.pageserver.http_client() - pg = env.postgres.create_start("main") - - tenant_id = TenantId(pg.safe_psql("show neon.tenant_id")[0][0]) - timeline_id = TimelineId(pg.safe_psql("show neon.timeline_id")[0][0]) - - with pg.cursor() as cur: - cur.execute("CREATE TABLE t0 AS VALUES (123, 'second column as text');") - current_lsn = Lsn(query_scalar(cur, "SELECT pg_current_wal_flush_lsn()")) - - # flush, wait until in remote storage - wait_for_last_record_lsn(pageserver_http, tenant_id, timeline_id, current_lsn) - pageserver_http.timeline_checkpoint(tenant_id, timeline_id) - wait_for_upload(pageserver_http, tenant_id, timeline_id, current_lsn) - - env.postgres.stop_all() - env.pageserver.stop() - - # change the remote file to have entry with .0.old suffix - timeline_path = local_fs_index_part_path(env, tenant_id, timeline_id) - with open(timeline_path, "r+") as timeline_file: - # keep the deserialized for later inspection - orig_index_part = json.load(timeline_file) - backup_layer_name = orig_index_part["timeline_layers"][0] + ".0.old" - orig_index_part["timeline_layers"].append(backup_layer_name) - - timeline_file.seek(0) - json.dump(orig_index_part, timeline_file) - - env.pageserver.start() - pageserver_http = env.pageserver.http_client() - - wait_until( - number_of_iterations=5, - interval=1, - func=lambda: assert_tenant_status(pageserver_http, tenant_id, "Active"), - ) - - pg = env.postgres.create_start("main") - - with pg.cursor() as cur: - cur.execute("INSERT INTO t0 VALUES (234, 'test data');") - current_lsn = Lsn(query_scalar(cur, "SELECT pg_current_wal_flush_lsn()")) - - wait_for_last_record_lsn(pageserver_http, tenant_id, timeline_id, current_lsn) - pageserver_http.timeline_checkpoint(tenant_id, timeline_id) - wait_for_upload(pageserver_http, tenant_id, timeline_id, current_lsn) - - # not needed anymore - env.postgres.stop_all() - env.pageserver.stop() - - # the .old file is gone from newly serialized index_part - new_index_part = local_fs_index_part(env, tenant_id, timeline_id) - backup_layers = filter(lambda x: x.endswith(".old"), new_index_part["timeline_layers"]) - assert len(list(backup_layers)) == 0 - - @pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS]) def test_tenant_redownloads_truncated_file_on_startup( neon_env_builder: NeonEnvBuilder, remote_storage_kind: RemoteStorageKind From 0f7de847856510807b5794d5c3903e058b07d066 Mon Sep 17 00:00:00 2001 From: Shany Pozin Date: Wed, 22 Mar 2023 09:17:00 +0200 Subject: [PATCH 05/23] Allow calling detach on ignored tenant (#3834) ## Describe your changes Added a query param to detach API Allow to remove local state of a tenant even if its not in the memory (following ignore API) ## Issue ticket number and link #3828 ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: Kirill Bulatov --- pageserver/src/http/openapi_spec.yml | 7 ++ pageserver/src/http/routes.rs | 3 +- pageserver/src/tenant/mgr.rs | 32 +++++--- test_runner/fixtures/neon_fixtures.py | 14 +++- test_runner/regress/test_tenant_detach.py | 90 ++++++++++++++++++++++- 5 files changed, 130 insertions(+), 16 deletions(-) diff --git a/pageserver/src/http/openapi_spec.yml b/pageserver/src/http/openapi_spec.yml index 3d3a9892bf..2098f848d5 100644 --- a/pageserver/src/http/openapi_spec.yml +++ b/pageserver/src/http/openapi_spec.yml @@ -351,6 +351,13 @@ paths: schema: type: string format: hex + - name: detach_ignored + in: query + required: false + schema: + type: boolean + description: | + When true, allow to detach a tenant which state is ignored. post: description: | Remove tenant data (including all corresponding timelines) from pageserver's memory and file system. diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index d91e421a52..04b7928d31 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -384,10 +384,11 @@ async fn timeline_delete_handler(request: Request) -> Result) -> Result, ApiError> { let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; check_permission(&request, Some(tenant_id))?; + let detach_ignored: Option = parse_query_param(&request, "detach_ignored")?; let state = get_state(&request); let conf = state.conf; - mgr::detach_tenant(conf, tenant_id) + mgr::detach_tenant(conf, tenant_id, detach_ignored.unwrap_or(false)) .instrument(info_span!("tenant_detach", tenant = %tenant_id)) .await?; diff --git a/pageserver/src/tenant/mgr.rs b/pageserver/src/tenant/mgr.rs index a4212ea8a6..26a2bb972c 100644 --- a/pageserver/src/tenant/mgr.rs +++ b/pageserver/src/tenant/mgr.rs @@ -315,10 +315,6 @@ pub async fn get_tenant( .get(&tenant_id) .ok_or(TenantStateError::NotFound(tenant_id))?; if active_only && !tenant.is_active() { - tracing::warn!( - "Tenant {tenant_id} is not active. Current state: {:?}", - tenant.current_state() - ); Err(TenantStateError::NotActive(tenant_id)) } else { Ok(Arc::clone(tenant)) @@ -350,17 +346,35 @@ pub enum TenantStateError { pub async fn detach_tenant( conf: &'static PageServerConf, tenant_id: TenantId, + detach_ignored: bool, ) -> Result<(), TenantStateError> { - remove_tenant_from_memory(tenant_id, async { - let local_tenant_directory = conf.tenant_path(&tenant_id); + let local_files_cleanup_operation = |tenant_id_to_clean| async move { + let local_tenant_directory = conf.tenant_path(&tenant_id_to_clean); fs::remove_dir_all(&local_tenant_directory) .await .with_context(|| { - format!("Failed to remove local tenant directory {local_tenant_directory:?}") + format!("local tenant directory {local_tenant_directory:?} removal") })?; Ok(()) - }) - .await + }; + + let removal_result = + remove_tenant_from_memory(tenant_id, local_files_cleanup_operation(tenant_id)).await; + + // Ignored tenants are not present in memory and will bail the removal from memory operation. + // Before returning the error, check for ignored tenant removal case — we only need to clean its local files then. + if detach_ignored && matches!(removal_result, Err(TenantStateError::NotFound(_))) { + let tenant_ignore_mark = conf.tenant_ignore_mark_file_path(tenant_id); + if tenant_ignore_mark.exists() { + info!("Detaching an ignored tenant"); + local_files_cleanup_operation(tenant_id) + .await + .with_context(|| format!("Ignored tenant {tenant_id} local files cleanup"))?; + return Ok(()); + } + } + + removal_result } pub async fn load_tenant( diff --git a/test_runner/fixtures/neon_fixtures.py b/test_runner/fixtures/neon_fixtures.py index 6429b1e940..9929d3e66b 100644 --- a/test_runner/fixtures/neon_fixtures.py +++ b/test_runner/fixtures/neon_fixtures.py @@ -1119,7 +1119,9 @@ def neon_env_builder( class PageserverApiException(Exception): - pass + def __init__(self, message, status_code: int): + super().__init__(message) + self.status_code = status_code class PageserverHttpClient(requests.Session): @@ -1140,7 +1142,7 @@ class PageserverHttpClient(requests.Session): msg = res.json()["msg"] except: # noqa: E722 msg = "" - raise PageserverApiException(msg) from e + raise PageserverApiException(msg, res.status_code) from e def check_status(self): self.get(f"http://localhost:{self.port}/v1/status").raise_for_status() @@ -1190,8 +1192,12 @@ class PageserverHttpClient(requests.Session): res = self.post(f"http://localhost:{self.port}/v1/tenant/{tenant_id}/attach") self.verbose_error(res) - def tenant_detach(self, tenant_id: TenantId): - res = self.post(f"http://localhost:{self.port}/v1/tenant/{tenant_id}/detach") + def tenant_detach(self, tenant_id: TenantId, detach_ignored=False): + params = {} + if detach_ignored: + params["detach_ignored"] = "true" + + res = self.post(f"http://localhost:{self.port}/v1/tenant/{tenant_id}/detach", params=params) self.verbose_error(res) def tenant_load(self, tenant_id: TenantId): diff --git a/test_runner/regress/test_tenant_detach.py b/test_runner/regress/test_tenant_detach.py index e061ab92a4..5db79eef4a 100644 --- a/test_runner/regress/test_tenant_detach.py +++ b/test_runner/regress/test_tenant_detach.py @@ -264,9 +264,11 @@ def test_tenant_detach_smoke(neon_env_builder: NeonEnvBuilder): with pytest.raises( expected_exception=PageserverApiException, match=f"NotFound: tenant {tenant_id}", - ): + ) as excinfo: pageserver_http.tenant_detach(tenant_id) + assert excinfo.value.status_code == 404 + # the error will be printed to the log too env.pageserver.allowed_errors.append(".*NotFound: tenant *") @@ -325,7 +327,91 @@ def test_tenant_detach_smoke(neon_env_builder: NeonEnvBuilder): pageserver_http.timeline_gc(tenant_id, timeline_id, 0) -# +# Creates and ignores a tenant, then detaches it: first, with no parameters (should fail), +# then with parameters to force ignored tenant detach (should not fail). +def test_tenant_detach_ignored_tenant(neon_simple_env: NeonEnv): + env = neon_simple_env + client = env.pageserver.http_client() + + # create a new tenant + tenant_id, _ = env.neon_cli.create_tenant() + + # assert tenant exists on disk + assert (env.repo_dir / "tenants" / str(tenant_id)).exists() + + pg = env.postgres.create_start("main", tenant_id=tenant_id) + # we rely upon autocommit after each statement + pg.safe_psql_many( + queries=[ + "CREATE TABLE t(key int primary key, value text)", + "INSERT INTO t SELECT generate_series(1,100000), 'payload'", + ] + ) + + # ignore tenant + client.tenant_ignore(tenant_id) + env.pageserver.allowed_errors.append(".*NotFound: tenant .*") + # ensure tenant couldn't be detached without the special flag for ignored tenant + log.info("detaching ignored tenant WITHOUT required flag") + with pytest.raises( + expected_exception=PageserverApiException, match=f"NotFound: tenant {tenant_id}" + ): + client.tenant_detach(tenant_id) + + log.info("tenant detached failed as expected") + + # ensure tenant is detached with ignore state + log.info("detaching ignored tenant with required flag") + client.tenant_detach(tenant_id, True) + log.info("ignored tenant detached without error") + + # check that nothing is left on disk for deleted tenant + assert not (env.repo_dir / "tenants" / str(tenant_id)).exists() + + # assert the tenant does not exists in the Pageserver + tenants_after_detach = [tenant["id"] for tenant in client.tenant_list()] + assert ( + tenant_id not in tenants_after_detach + ), f"Ignored and then detached tenant {tenant_id} \ + should not be present in pageserver's memory" + + +# Creates a tenant, and detaches it with extra paremeter that forces ignored tenant detach. +# Tenant should be detached without issues. +def test_tenant_detach_regular_tenant(neon_simple_env: NeonEnv): + env = neon_simple_env + client = env.pageserver.http_client() + + # create a new tenant + tenant_id, _ = env.neon_cli.create_tenant() + + # assert tenant exists on disk + assert (env.repo_dir / "tenants" / str(tenant_id)).exists() + + pg = env.postgres.create_start("main", tenant_id=tenant_id) + # we rely upon autocommit after each statement + pg.safe_psql_many( + queries=[ + "CREATE TABLE t(key int primary key, value text)", + "INSERT INTO t SELECT generate_series(1,100000), 'payload'", + ] + ) + + log.info("detaching regular tenant with detach ignored flag") + client.tenant_detach(tenant_id, True) + log.info("regular tenant detached without error") + + # check that nothing is left on disk for deleted tenant + assert not (env.repo_dir / "tenants" / str(tenant_id)).exists() + + # assert the tenant does not exists in the Pageserver + tenants_after_detach = [tenant["id"] for tenant in client.tenant_list()] + assert ( + tenant_id not in tenants_after_detach + ), f"Ignored and then detached tenant {tenant_id} \ + should not be present in pageserver's memory" + + @pytest.mark.parametrize("remote_storage_kind", available_remote_storages()) def test_detach_while_attaching( neon_env_builder: NeonEnvBuilder, From 14a40c9ca67568beac3783f597e44990c2c9a9e4 Mon Sep 17 00:00:00 2001 From: mikecaat <35882227+mikecaat@users.noreply.github.com> Date: Wed, 22 Mar 2023 17:10:53 +0900 Subject: [PATCH 06/23] Fix minor things for the docker-compose file (#3862) * Add the REPOSITORY env to build args to avoid the following error when executing without the credentials for the repository. ``` ERROR: Service 'compute' failed to build: Head "https://369495373322.dkr.ecr.eu-central-1.amazonaws.com/v2/compute-node-v15/manifests/2221": no basic auth credentials ``` * update the tag version in the documentation to support storage broker --- docker-compose/docker-compose.yml | 1 + docs/docker.md | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/docker-compose/docker-compose.yml b/docker-compose/docker-compose.yml index b24cb80ce4..4926dad932 100644 --- a/docker-compose/docker-compose.yml +++ b/docker-compose/docker-compose.yml @@ -160,6 +160,7 @@ services: build: context: ./compute_wrapper/ args: + - REPOSITORY=${REPOSITORY:-neondatabase} - COMPUTE_IMAGE=compute-node-v${PG_VERSION:-14} - TAG=${TAG:-latest} - http_proxy=$http_proxy diff --git a/docs/docker.md b/docs/docker.md index d264a1a748..704044377f 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -37,9 +37,9 @@ You can specify version of neon cluster using following environment values. - PG_VERSION: postgres version for compute (default is 14) - TAG: the tag version of [docker image](https://registry.hub.docker.com/r/neondatabase/neon/tags) (default is latest), which is tagged in [CI test](/.github/workflows/build_and_test.yml) ``` -$ cd docker-compose/docker-compose.yml +$ cd docker-compose/ $ docker-compose down # remove the conainers if exists -$ PG_VERSION=15 TAG=2221 docker-compose up --build -d # You can specify the postgres and image version +$ PG_VERSION=15 TAG=2937 docker-compose up --build -d # You can specify the postgres and image version Creating network "dockercompose_default" with the default driver Creating docker-compose_storage_broker_1 ... done (...omit...) From 6033dfdf4a9cbc5f81db551d6a8b259445f390c5 Mon Sep 17 00:00:00 2001 From: Joonas Koivunen Date: Wed, 22 Mar 2023 16:26:27 +0200 Subject: [PATCH 07/23] Re-access layers before threshold eviction (#3867) To avoid re-downloading evicted files on restart, re-compute logical size and partitioning before each threshold based eviction run. Cc: #3802 Co-authored-by: Christian Schwarz --- .../src/tenant/timeline/eviction_task.rs | 81 +++++++++++++++++-- 1 file changed, 74 insertions(+), 7 deletions(-) diff --git a/pageserver/src/tenant/timeline/eviction_task.rs b/pageserver/src/tenant/timeline/eviction_task.rs index 2aad0ef0f3..666768ff87 100644 --- a/pageserver/src/tenant/timeline/eviction_task.rs +++ b/pageserver/src/tenant/timeline/eviction_task.rs @@ -1,5 +1,18 @@ -//! The per-timeline layer eviction task. - +//! The per-timeline layer eviction task, which evicts data which has not been accessed for more +//! than a given threshold. +//! +//! Data includes all kinds of caches, namely: +//! - (in-memory layers) +//! - on-demand downloaded layer files on disk +//! - (cached layer file pages) +//! - derived data from layer file contents, namely: +//! - initial logical size +//! - partitioning +//! - (other currently missing unknowns) +//! +//! Items with parentheses are not (yet) touched by this task. +//! +//! See write-up on restart on-demand download spike: use std::{ ops::ControlFlow, sync::Arc, @@ -12,6 +25,7 @@ use tokio_util::sync::CancellationToken; use tracing::{debug, error, info, instrument, warn}; use crate::{ + context::{DownloadBehavior, RequestContext}, task_mgr::{self, TaskKind, BACKGROUND_RUNTIME}, tenant::{ config::{EvictionPolicy, EvictionPolicyLayerAccessThreshold}, @@ -54,9 +68,10 @@ impl Timeline { } } + let ctx = RequestContext::new(TaskKind::Eviction, DownloadBehavior::Warn); loop { let policy = self.get_eviction_policy(); - let cf = self.eviction_iteration(&policy, cancel.clone()).await; + let cf = self.eviction_iteration(&policy, &cancel, &ctx).await; match cf { ControlFlow::Break(()) => break, @@ -77,7 +92,8 @@ impl Timeline { async fn eviction_iteration( self: &Arc, policy: &EvictionPolicy, - cancel: CancellationToken, + cancel: &CancellationToken, + ctx: &RequestContext, ) -> ControlFlow<(), Instant> { debug!("eviction iteration: {policy:?}"); match policy { @@ -87,7 +103,7 @@ impl Timeline { } EvictionPolicy::LayerAccessThreshold(p) => { let start = Instant::now(); - match self.eviction_iteration_threshold(p, cancel).await { + match self.eviction_iteration_threshold(p, cancel, ctx).await { ControlFlow::Break(()) => return ControlFlow::Break(()), ControlFlow::Continue(()) => (), } @@ -101,7 +117,8 @@ impl Timeline { async fn eviction_iteration_threshold( self: &Arc, p: &EvictionPolicyLayerAccessThreshold, - cancel: CancellationToken, + cancel: &CancellationToken, + ctx: &RequestContext, ) -> ControlFlow<()> { let now = SystemTime::now(); @@ -114,6 +131,20 @@ impl Timeline { not_evictable: usize, skipped_for_shutdown: usize, } + + // what we want is to invalidate any caches which haven't been accessed for `p.threshold`, + // but we cannot actually do it for current limitations except by restarting pageserver. we + // just recompute the values which would be recomputed on startup. + // + // for active tenants this will likely materialized page cache or in-memory layers. for + // inactive tenants it will refresh the last_access timestamps so that we will not evict + // and re-download on restart these layers. + self.refresh_layers_required_in_restart(cancel, ctx).await; + + if cancel.is_cancelled() { + return ControlFlow::Break(()); + } + let mut stats = EvictionStats::default(); // Gather layers for eviction. // NB: all the checks can be invalidated as soon as we release the layer map lock. @@ -174,7 +205,7 @@ impl Timeline { }; let results = match self - .evict_layer_batch(remote_client, &candidates[..], cancel) + .evict_layer_batch(remote_client, &candidates[..], cancel.clone()) .await { Err(pre_err) => { @@ -216,4 +247,40 @@ impl Timeline { } ControlFlow::Continue(()) } + + /// Recompute the values which would cause on-demand downloads during restart. + async fn refresh_layers_required_in_restart( + &self, + cancel: &CancellationToken, + ctx: &RequestContext, + ) { + let lsn = self.get_last_record_lsn(); + + // imitiate on-restart initial logical size + let size = self.calculate_logical_size(lsn, cancel.clone(), ctx).await; + + match &size { + Ok(_size) => { + // good, don't log it to avoid confusion + } + Err(_) => { + // we have known issues for which we already log this on consumption metrics, + // gc, and compaction. leave logging out for now. + // + // https://github.com/neondatabase/neon/issues/2539 + } + } + + // imitiate repartiting on first compactation + if let Err(e) = self.collect_keyspace(lsn, ctx).await { + // if this failed, we probably failed logical size because these use the same keys + if size.is_err() { + // ignore, see above comment + } else { + warn!( + "failed to collect keyspace but succeeded in calculating logical size: {e:#}" + ); + } + } + } } From 8bd565e09ede7b3af0e0dddce68968ba02ac8b54 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 22 Mar 2023 17:42:31 +0200 Subject: [PATCH 08/23] Ensure branches with no layers have their remote storage counterpart created eventually (#3857) Discovered during writing a test for https://github.com/neondatabase/neon/pull/3843 --- pageserver/src/tenant.rs | 39 +++-- pageserver/src/tenant/timeline.rs | 51 +++++-- test_runner/regress/test_remote_storage.py | 164 ++++++++++++++++++--- 3 files changed, 205 insertions(+), 49 deletions(-) diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs index 5f1e23b873..b462c93b2d 100644 --- a/pageserver/src/tenant.rs +++ b/pageserver/src/tenant.rs @@ -478,7 +478,7 @@ impl Tenant { let dummy_timeline = self.create_timeline_data( timeline_id, - up_to_date_metadata.clone(), + up_to_date_metadata, ancestor.clone(), remote_client, )?; @@ -503,7 +503,7 @@ impl Tenant { let broken_timeline = self .create_timeline_data( timeline_id, - up_to_date_metadata.clone(), + up_to_date_metadata, ancestor.clone(), None, ) @@ -1142,7 +1142,7 @@ impl Tenant { ); self.prepare_timeline( new_timeline_id, - new_metadata, + &new_metadata, timeline_uninit_mark, true, None, @@ -1700,7 +1700,7 @@ impl Tenant { fn create_timeline_data( &self, new_timeline_id: TimelineId, - new_metadata: TimelineMetadata, + new_metadata: &TimelineMetadata, ancestor: Option>, remote_client: Option, ) -> anyhow::Result> { @@ -2160,13 +2160,25 @@ impl Tenant { let new_timeline = self .prepare_timeline( dst_id, - metadata, + &metadata, timeline_uninit_mark, false, Some(Arc::clone(src_timeline)), )? .initialize_with_lock(&mut timelines, true, true)?; drop(timelines); + + // Root timeline gets its layers during creation and uploads them along with the metadata. + // A branch timeline though, when created, can get no writes for some time, hence won't get any layers created. + // We still need to upload its metadata eagerly: if other nodes `attach` the tenant and miss this timeline, their GC + // could get incorrect information and remove more layers, than needed. + // See also https://github.com/neondatabase/neon/issues/3865 + if let Some(remote_client) = new_timeline.remote_client.as_ref() { + remote_client + .schedule_index_upload_for_metadata_update(&metadata) + .context("branch initial metadata upload")?; + } + info!("branched timeline {dst_id} from {src_id} at {start_lsn}"); Ok(new_timeline) @@ -2229,7 +2241,7 @@ impl Tenant { pg_version, ); let raw_timeline = - self.prepare_timeline(timeline_id, new_metadata, timeline_uninit_mark, true, None)?; + self.prepare_timeline(timeline_id, &new_metadata, timeline_uninit_mark, true, None)?; let tenant_id = raw_timeline.owning_tenant.tenant_id; let unfinished_timeline = raw_timeline.raw_timeline()?; @@ -2283,7 +2295,7 @@ impl Tenant { fn prepare_timeline( &self, new_timeline_id: TimelineId, - new_metadata: TimelineMetadata, + new_metadata: &TimelineMetadata, uninit_mark: TimelineUninitMark, init_layers: bool, ancestor: Option>, @@ -2297,7 +2309,7 @@ impl Tenant { tenant_id, new_timeline_id, ); - remote_client.init_upload_queue_for_empty_remote(&new_metadata)?; + remote_client.init_upload_queue_for_empty_remote(new_metadata)?; Some(remote_client) } else { None @@ -2336,17 +2348,12 @@ impl Tenant { &self, timeline_path: &Path, new_timeline_id: TimelineId, - new_metadata: TimelineMetadata, + new_metadata: &TimelineMetadata, ancestor: Option>, remote_client: Option, ) -> anyhow::Result> { let timeline_data = self - .create_timeline_data( - new_timeline_id, - new_metadata.clone(), - ancestor, - remote_client, - ) + .create_timeline_data(new_timeline_id, new_metadata, ancestor, remote_client) .context("Failed to create timeline data structure")?; crashsafe::create_dir_all(timeline_path).context("Failed to create timeline directory")?; @@ -2358,7 +2365,7 @@ impl Tenant { self.conf, new_timeline_id, self.tenant_id, - &new_metadata, + new_metadata, true, ) .context("Failed to create timeline metadata")?; diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs index 4d03a78883..33909e749b 100644 --- a/pageserver/src/tenant/timeline.rs +++ b/pageserver/src/tenant/timeline.rs @@ -1163,7 +1163,7 @@ impl Timeline { pub(super) fn new( conf: &'static PageServerConf, tenant_conf: Arc>, - metadata: TimelineMetadata, + metadata: &TimelineMetadata, ancestor: Option>, timeline_id: TimelineId, tenant_id: TenantId, @@ -1629,6 +1629,8 @@ impl Timeline { .map(|l| (l.filename(), l)) .collect::>(); + // If no writes happen, new branches do not have any layers, only the metadata file. + let has_local_layers = !local_layers.is_empty(); let local_only_layers = match index_part { Some(index_part) => { info!( @@ -1646,21 +1648,40 @@ impl Timeline { } }; - // Are there local files that don't exist remotely? Schedule uploads for them - for (layer_name, layer) in &local_only_layers { - // XXX solve this in the type system - let layer_path = layer - .local_path() - .expect("local_only_layers only contains local layers"); - let layer_size = layer_path - .metadata() - .with_context(|| format!("failed to get file {layer_path:?} metadata"))? - .len(); - info!("scheduling {layer_path:?} for upload"); - remote_client - .schedule_layer_file_upload(layer_name, &LayerFileMetadata::new(layer_size))?; + if has_local_layers { + // Are there local files that don't exist remotely? Schedule uploads for them. + // Local timeline metadata will get uploaded to remove along witht he layers. + for (layer_name, layer) in &local_only_layers { + // XXX solve this in the type system + let layer_path = layer + .local_path() + .expect("local_only_layers only contains local layers"); + let layer_size = layer_path + .metadata() + .with_context(|| format!("failed to get file {layer_path:?} metadata"))? + .len(); + info!("scheduling {layer_path:?} for upload"); + remote_client + .schedule_layer_file_upload(layer_name, &LayerFileMetadata::new(layer_size))?; + } + remote_client.schedule_index_upload_for_file_changes()?; + } else if index_part.is_none() { + // No data on the remote storage, no local layers, local metadata file. + // + // TODO https://github.com/neondatabase/neon/issues/3865 + // Currently, console does not wait for the timeline data upload to the remote storage + // and considers the timeline created, expecting other pageserver nodes to work with it. + // Branch metadata upload could get interrupted (e.g pageserver got killed), + // hence any locally existing branch metadata with no remote counterpart should be uploaded, + // otherwise any other pageserver won't see the branch on `attach`. + // + // After the issue gets implemented, pageserver should rather remove the branch, + // since absence on S3 means we did not acknowledge the branch creation and console will have to retry, + // no need to keep the old files. + remote_client.schedule_index_upload_for_metadata_update(up_to_date_metadata)?; + } else { + // Local timeline has a metadata file, remote one too, both have no layers to sync. } - remote_client.schedule_index_upload_for_file_changes()?; info!("Done"); diff --git a/test_runner/regress/test_remote_storage.py b/test_runner/regress/test_remote_storage.py index 1f6f0c67cc..f6600e8974 100644 --- a/test_runner/regress/test_remote_storage.py +++ b/test_runner/regress/test_remote_storage.py @@ -11,8 +11,10 @@ from typing import Dict, List, Tuple import pytest from fixtures.log_helper import log from fixtures.neon_fixtures import ( + LocalFsStorage, NeonEnvBuilder, PageserverApiException, + PageserverHttpClient, RemoteStorageKind, available_remote_storages, wait_for_last_flush_lsn, @@ -421,23 +423,6 @@ def test_remote_timeline_client_calls_started_metric( ) wait_for_last_flush_lsn(env, pg, tenant_id, timeline_id) - def get_queued_count(file_kind, op_kind): - val = client.get_remote_timeline_client_metric( - "pageserver_remote_timeline_client_calls_unfinished", - tenant_id, - timeline_id, - file_kind, - op_kind, - ) - if val is None: - return val - return int(val) - - def wait_upload_queue_empty(): - wait_until(2, 1, lambda: get_queued_count(file_kind="layer", op_kind="upload") == 0) - wait_until(2, 1, lambda: get_queued_count(file_kind="index", op_kind="upload") == 0) - wait_until(2, 1, lambda: get_queued_count(file_kind="layer", op_kind="delete") == 0) - calls_started: Dict[Tuple[str, str], List[int]] = { ("layer", "upload"): [0], ("index", "upload"): [0], @@ -478,7 +463,7 @@ def test_remote_timeline_client_calls_started_metric( # create some layers & wait for uploads to finish churn("a", "b") - wait_upload_queue_empty() + wait_upload_queue_empty(client, tenant_id, timeline_id) # ensure that we updated the calls_started metric fetch_calls_started() @@ -637,4 +622,147 @@ def test_timeline_deletion_with_files_stuck_in_upload_queue( time.sleep(10) +# Branches off a root branch, but does not write anything to the new branch, so it has a metadata file only. +# Ensures that such branch is still persisted on the remote storage, and can be restored during tenant (re)attach. +@pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS]) +def test_empty_branch_remote_storage_upload( + neon_env_builder: NeonEnvBuilder, + remote_storage_kind: RemoteStorageKind, +): + neon_env_builder.enable_remote_storage( + remote_storage_kind=remote_storage_kind, + test_name="test_empty_branch_remote_storage_upload", + ) + + env = neon_env_builder.init_start() + client = env.pageserver.http_client() + + new_branch_name = "new_branch" + new_branch_timeline_id = env.neon_cli.create_branch(new_branch_name, "main", env.initial_tenant) + + with env.postgres.create_start(new_branch_name, tenant_id=env.initial_tenant) as pg: + wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_branch_timeline_id) + wait_upload_queue_empty(client, env.initial_tenant, new_branch_timeline_id) + + timelines_before_detach = set( + map( + lambda t: TimelineId(t["timeline_id"]), + client.timeline_list(env.initial_tenant), + ) + ) + expected_timelines = set([env.initial_timeline, new_branch_timeline_id]) + assert ( + timelines_before_detach == expected_timelines + ), f"Expected to have an initial timeline and the branch timeline only, but got {timelines_before_detach}" + + client.tenant_detach(env.initial_tenant) + client.tenant_attach(env.initial_tenant) + wait_until_tenant_state(client, env.initial_tenant, "Active", 5) + + timelines_after_detach = set( + map( + lambda t: TimelineId(t["timeline_id"]), + client.timeline_list(env.initial_tenant), + ) + ) + + assert ( + timelines_before_detach == timelines_after_detach + ), f"Expected to have same timelines after reattach, but got {timelines_after_detach}" + + +# Branches off a root branch, but does not write anything to the new branch, so it has a metadata file only. +# Ensures the branch is not on the remote storage and restarts the pageserver — the branch should be uploaded after the restart. +@pytest.mark.parametrize("remote_storage_kind", [RemoteStorageKind.LOCAL_FS]) +def test_empty_branch_remote_storage_upload_on_restart( + neon_env_builder: NeonEnvBuilder, + remote_storage_kind: RemoteStorageKind, +): + neon_env_builder.enable_remote_storage( + remote_storage_kind=remote_storage_kind, + test_name="test_empty_branch_remote_storage_upload_on_restart", + ) + + env = neon_env_builder.init_start() + client = env.pageserver.http_client() + + new_branch_name = "new_branch" + new_branch_timeline_id = env.neon_cli.create_branch(new_branch_name, "main", env.initial_tenant) + + with env.postgres.create_start(new_branch_name, tenant_id=env.initial_tenant) as pg: + wait_for_last_flush_lsn(env, pg, env.initial_tenant, new_branch_timeline_id) + wait_upload_queue_empty(client, env.initial_tenant, new_branch_timeline_id) + + env.pageserver.stop() + + # Remove new branch from the remote storage + assert isinstance(env.remote_storage, LocalFsStorage) + new_branch_on_remote_storage = ( + env.remote_storage.root + / "tenants" + / str(env.initial_tenant) + / "timelines" + / str(new_branch_timeline_id) + ) + assert ( + new_branch_on_remote_storage.is_dir() + ), f"'{new_branch_on_remote_storage}' path does not exist on the remote storage" + shutil.rmtree(new_branch_on_remote_storage) + + env.pageserver.start() + + wait_upload_queue_empty(client, env.initial_tenant, new_branch_timeline_id) + assert ( + new_branch_on_remote_storage.is_dir() + ), f"New branch should have been reuploaded on pageserver restart to the remote storage path '{new_branch_on_remote_storage}'" + + +def wait_upload_queue_empty( + client: PageserverHttpClient, tenant_id: TenantId, timeline_id: TimelineId +): + wait_until( + 2, + 1, + lambda: get_queued_count( + client, tenant_id, timeline_id, file_kind="layer", op_kind="upload" + ) + == 0, + ) + wait_until( + 2, + 1, + lambda: get_queued_count( + client, tenant_id, timeline_id, file_kind="index", op_kind="upload" + ) + == 0, + ) + wait_until( + 2, + 1, + lambda: get_queued_count( + client, tenant_id, timeline_id, file_kind="layer", op_kind="delete" + ) + == 0, + ) + + +def get_queued_count( + client: PageserverHttpClient, + tenant_id: TenantId, + timeline_id: TimelineId, + file_kind: str, + op_kind: str, +): + val = client.get_remote_timeline_client_metric( + "pageserver_remote_timeline_client_calls_unfinished", + tenant_id, + timeline_id, + file_kind, + op_kind, + ) + if val is None: + return val + return int(val) + + # TODO Test that we correctly handle GC of files that are stuck in upload queue. From f5ca897292d78faf6a447a8f4542b1b1400e6dd1 Mon Sep 17 00:00:00 2001 From: Joonas Koivunen Date: Thu, 23 Mar 2023 12:00:52 +0200 Subject: [PATCH 09/23] fix: less logging at shutdown (#3866) Log less during shutdown; don't log anything for quickly (less than 1s) exiting tasks. --- pageserver/src/task_mgr.rs | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/pageserver/src/task_mgr.rs b/pageserver/src/task_mgr.rs index 2734031a09..44b1bbb06d 100644 --- a/pageserver/src/task_mgr.rs +++ b/pageserver/src/task_mgr.rs @@ -481,13 +481,25 @@ pub async fn shutdown_tasks( for task in victim_tasks { let join_handle = { let mut task_mut = task.mutable.lock().unwrap(); - info!("waiting for {} to shut down", task.name); - let join_handle = task_mut.join_handle.take(); - drop(task_mut); - join_handle + task_mut.join_handle.take() }; - if let Some(join_handle) = join_handle { - let _ = join_handle.await; + if let Some(mut join_handle) = join_handle { + let completed = tokio::select! { + _ = &mut join_handle => { true }, + _ = tokio::time::sleep(std::time::Duration::from_secs(1)) => { + // allow some time to elapse before logging to cut down the number of log + // lines. + info!("waiting for {} to shut down", task.name); + false + } + }; + if !completed { + // we never handled this return value, but: + // - we don't deschedule which would lead to is_cancelled + // - panics are already logged (is_panicked) + // - task errors are already logged in the wrapper + let _ = join_handle.await; + } } else { // Possibly one of: // * The task had not even fully started yet. From 870ba43a1ff6840b56d64709a59e3616e4040870 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Fri, 24 Mar 2023 20:25:39 +0300 Subject: [PATCH 10/23] return proper http codes in timeline delete endpoint (#3876) return proper http codes in timeline delete endpoint + fix openapi spec for detach to include 404 responses --- pageserver/src/http/openapi_spec.yml | 12 +++++++++++ pageserver/src/http/routes.rs | 23 +++++++++++++++++++++ pageserver/src/tenant.rs | 22 ++++++++++++++------ pageserver/src/tenant/mgr.rs | 11 +++++++++- test_runner/regress/test_timeline_delete.py | 12 ++++++++--- 5 files changed, 70 insertions(+), 10 deletions(-) diff --git a/pageserver/src/http/openapi_spec.yml b/pageserver/src/http/openapi_spec.yml index 2098f848d5..b8c3bffcd5 100644 --- a/pageserver/src/http/openapi_spec.yml +++ b/pageserver/src/http/openapi_spec.yml @@ -183,6 +183,12 @@ paths: application/json: schema: $ref: "#/components/schemas/ForbiddenError" + "404": + description: Timeline not found + content: + application/json: + schema: + $ref: "#/components/schemas/NotFoundError" "500": description: Generic operation error content: @@ -383,6 +389,12 @@ paths: application/json: schema: $ref: "#/components/schemas/ForbiddenError" + "404": + description: Tenant not found + content: + application/json: + schema: + $ref: "#/components/schemas/NotFoundError" "500": description: Generic operation error content: diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 04b7928d31..ba53729ea9 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -131,6 +131,29 @@ impl From for ApiError { } } +impl From for ApiError { + fn from(value: crate::tenant::DeleteTimelineError) -> Self { + use crate::tenant::DeleteTimelineError::*; + match value { + NotFound => ApiError::NotFound(anyhow::anyhow!("timeline not found")), + HasChildren => ApiError::BadRequest(anyhow::anyhow!( + "Cannot delete timeline which has child timelines" + )), + Other(e) => ApiError::InternalServerError(e), + } + } +} + +impl From for ApiError { + fn from(value: crate::tenant::mgr::DeleteTimelineError) -> Self { + use crate::tenant::mgr::DeleteTimelineError::*; + match value { + Tenant(t) => ApiError::from(t), + Timeline(t) => ApiError::from(t), + } + } +} + // Helper function to construct a TimelineInfo struct for a timeline async fn build_timeline_info( timeline: &Arc, diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs index b462c93b2d..0a167fd787 100644 --- a/pageserver/src/tenant.rs +++ b/pageserver/src/tenant.rs @@ -431,6 +431,16 @@ remote: } } +#[derive(Debug, thiserror::Error)] +pub enum DeleteTimelineError { + #[error("NotFound")] + NotFound, + #[error("HasChildren")] + HasChildren, + #[error(transparent)] + Other(#[from] anyhow::Error), +} + struct RemoteStartupData { index_part: IndexPart, remote_metadata: TimelineMetadata, @@ -1307,7 +1317,7 @@ impl Tenant { &self, timeline_id: TimelineId, _ctx: &RequestContext, - ) -> anyhow::Result<()> { + ) -> Result<(), DeleteTimelineError> { // Transition the timeline into TimelineState::Stopping. // This should prevent new operations from starting. let timeline = { @@ -1319,13 +1329,13 @@ impl Tenant { .iter() .any(|(_, entry)| entry.get_ancestor_timeline_id() == Some(timeline_id)); - anyhow::ensure!( - !children_exist, - "Cannot delete timeline which has child timelines" - ); + if children_exist { + return Err(DeleteTimelineError::HasChildren); + } + let timeline_entry = match timelines.entry(timeline_id) { Entry::Occupied(e) => e, - Entry::Vacant(_) => bail!("timeline not found"), + Entry::Vacant(_) => return Err(DeleteTimelineError::NotFound), }; let timeline = Arc::clone(timeline_entry.get()); diff --git a/pageserver/src/tenant/mgr.rs b/pageserver/src/tenant/mgr.rs index 26a2bb972c..4971186206 100644 --- a/pageserver/src/tenant/mgr.rs +++ b/pageserver/src/tenant/mgr.rs @@ -321,11 +321,20 @@ pub async fn get_tenant( } } +#[derive(Debug, thiserror::Error)] +pub enum DeleteTimelineError { + #[error("Tenant {0}")] + Tenant(#[from] TenantStateError), + + #[error("Timeline {0}")] + Timeline(#[from] crate::tenant::DeleteTimelineError), +} + pub async fn delete_timeline( tenant_id: TenantId, timeline_id: TimelineId, ctx: &RequestContext, -) -> Result<(), TenantStateError> { +) -> Result<(), DeleteTimelineError> { let tenant = get_tenant(tenant_id, true).await?; tenant.delete_timeline(timeline_id, ctx).await?; Ok(()) diff --git a/test_runner/regress/test_timeline_delete.py b/test_runner/regress/test_timeline_delete.py index b9c4f5b83f..30d894e04c 100644 --- a/test_runner/regress/test_timeline_delete.py +++ b/test_runner/regress/test_timeline_delete.py @@ -25,9 +25,11 @@ def test_timeline_delete(neon_simple_env: NeonEnv): with pytest.raises( PageserverApiException, match=f"NotFound: tenant {invalid_tenant_id}", - ): + ) as exc: ps_http.timeline_delete(tenant_id=invalid_tenant_id, timeline_id=invalid_timeline_id) + assert exc.value.status_code == 404 + # construct pair of branches to validate that pageserver prohibits # deletion of ancestor timelines when they have child branches parent_timeline_id = env.neon_cli.create_branch("test_ancestor_branch_delete_parent", "empty") @@ -39,7 +41,7 @@ def test_timeline_delete(neon_simple_env: NeonEnv): ps_http = env.pageserver.http_client() with pytest.raises( PageserverApiException, match="Cannot delete timeline which has child timelines" - ): + ) as exc: timeline_path = ( env.repo_dir / "tenants" @@ -53,6 +55,8 @@ def test_timeline_delete(neon_simple_env: NeonEnv): assert not timeline_path.exists() + assert exc.value.status_code == 400 + timeline_path = ( env.repo_dir / "tenants" / str(env.initial_tenant) / "timelines" / str(leaf_timeline_id) ) @@ -71,7 +75,7 @@ def test_timeline_delete(neon_simple_env: NeonEnv): with pytest.raises( PageserverApiException, match=f"Timeline {env.initial_tenant}/{leaf_timeline_id} was not found", - ): + ) as exc: ps_http.timeline_detail(env.initial_tenant, leaf_timeline_id) # FIXME leaves tenant without timelines, should we prevent deletion of root timeline? @@ -80,3 +84,5 @@ def test_timeline_delete(neon_simple_env: NeonEnv): interval=0.2, func=lambda: ps_http.timeline_delete(env.initial_tenant, parent_timeline_id), ) + + assert exc.value.status_code == 404 From 4071ff8c7b699565d79f772e44ca2423e00a6a3b Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Sat, 25 Mar 2023 12:33:39 +0000 Subject: [PATCH 11/23] Bump openssl from 0.10.45 to 0.10.48 in /test_runner/pg_clients/rust/tokio-postgres (#3879) --- test_runner/pg_clients/rust/tokio-postgres/Cargo.lock | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/test_runner/pg_clients/rust/tokio-postgres/Cargo.lock b/test_runner/pg_clients/rust/tokio-postgres/Cargo.lock index 96989ee5ee..a0067b183e 100644 --- a/test_runner/pg_clients/rust/tokio-postgres/Cargo.lock +++ b/test_runner/pg_clients/rust/tokio-postgres/Cargo.lock @@ -389,9 +389,9 @@ checksum = "b7e5500299e16ebb147ae15a00a942af264cf3688f47923b8fc2cd5858f23ad3" [[package]] name = "openssl" -version = "0.10.45" +version = "0.10.48" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b102428fd03bc5edf97f62620f7298614c45cedf287c271e7ed450bbaf83f2e1" +checksum = "518915b97df115dd36109bfa429a48b8f737bd05508cf9588977b599648926d2" dependencies = [ "bitflags", "cfg-if", @@ -421,9 +421,9 @@ checksum = "ff011a302c396a5197692431fc1948019154afc178baf7d8e37367442a4601cf" [[package]] name = "openssl-sys" -version = "0.9.80" +version = "0.9.83" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "23bbbf7854cd45b83958ebe919f0e8e516793727652e27fda10a8384cfc790b7" +checksum = "666416d899cf077260dac8698d60a60b435a46d57e82acb1be3d0dad87284e5b" dependencies = [ "autocfg", "cc", From 4d8c7654852cdae2375243aabd02e2f8a183e026 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 27 Mar 2023 12:04:48 +0300 Subject: [PATCH 12/23] remove redundant dyn (#3878) remove redundant dyn --- libs/utils/src/id.rs | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libs/utils/src/id.rs b/libs/utils/src/id.rs index f84bcb793f..b27c5cda35 100644 --- a/libs/utils/src/id.rs +++ b/libs/utils/src/id.rs @@ -23,7 +23,7 @@ pub enum IdError { struct Id([u8; 16]); impl Id { - pub fn get_from_buf(buf: &mut dyn bytes::Buf) -> Id { + pub fn get_from_buf(buf: &mut impl bytes::Buf) -> Id { let mut arr = [0u8; 16]; buf.copy_to_slice(&mut arr); Id::from(arr) @@ -112,7 +112,7 @@ impl fmt::Debug for Id { macro_rules! id_newtype { ($t:ident) => { impl $t { - pub fn get_from_buf(buf: &mut dyn bytes::Buf) -> $t { + pub fn get_from_buf(buf: &mut impl bytes::Buf) -> $t { $t(Id::get_from_buf(buf)) } From 8d783299919ac4c8d86fb5c5187450cc49c0108a Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Sat, 25 Mar 2023 13:43:04 +0200 Subject: [PATCH 13/23] Remove some dead code. whoami() was never called, 'is_test' was never set. 'restart()' might be useful, but it wasn't hooked up the CLI so it was dead code. It's not clear what kind of a restart it should perform, anyway: just restart Postgres, or re-initialize the data directory from a fresh basebackup like "stop"+"start" does. --- control_plane/src/compute.rs | 29 ----------------------------- 1 file changed, 29 deletions(-) diff --git a/control_plane/src/compute.rs b/control_plane/src/compute.rs index 730cacf40b..46f0ad8d4f 100644 --- a/control_plane/src/compute.rs +++ b/control_plane/src/compute.rs @@ -87,7 +87,6 @@ impl ComputeControlPlane { address: SocketAddr::new("127.0.0.1".parse().unwrap(), port), env: self.env.clone(), pageserver: Arc::clone(&self.pageserver), - is_test: false, timeline_id, lsn, tenant_id, @@ -113,7 +112,6 @@ pub struct PostgresNode { name: String, pub env: LocalEnv, pageserver: Arc, - is_test: bool, pub timeline_id: TimelineId, pub lsn: Option, // if it's a read-only node. None for primary pub tenant_id: TenantId, @@ -171,7 +169,6 @@ impl PostgresNode { name, env: env.clone(), pageserver: Arc::clone(pageserver), - is_test: false, timeline_id, lsn: recovery_target_lsn, tenant_id, @@ -480,10 +477,6 @@ impl PostgresNode { self.pg_ctl(&["start"], auth_token) } - pub fn restart(&self, auth_token: &Option) -> Result<()> { - self.pg_ctl(&["restart"], auth_token) - } - pub fn stop(&self, destroy: bool) -> Result<()> { // If we are going to destroy data directory, // use immediate shutdown mode, otherwise, @@ -514,26 +507,4 @@ impl PostgresNode { "postgres" ) } - - // XXX: cache that in control plane - pub fn whoami(&self) -> String { - let output = Command::new("whoami") - .output() - .expect("failed to execute whoami"); - - assert!(output.status.success(), "whoami failed"); - - String::from_utf8(output.stdout).unwrap().trim().to_string() - } -} - -impl Drop for PostgresNode { - // destructor to clean up state after test is done - // XXX: we may detect failed test by setting some flag in catch_unwind() - // and checking it here. But let just clean datadirs on start. - fn drop(&mut self) { - if self.is_test { - let _ = self.stop(true); - } - } } From e3cbcc2ea759a58f90c07e6ece984310bbb43492 Mon Sep 17 00:00:00 2001 From: Vadim Kharitonov Date: Mon, 27 Mar 2023 10:13:34 +0200 Subject: [PATCH 14/23] Revert "Add `neondatabase/release` team as a default reviewers for storage" This reverts commit daeaa767c405532f0c8bdb8a5765f0c13fd83aee. --- .github/workflows/release.yml | 1 - 1 file changed, 1 deletion(-) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 014084c410..4bce9cdd1e 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -31,4 +31,3 @@ jobs: head: releases/${{ steps.date.outputs.date }} base: release title: Release ${{ steps.date.outputs.date }} - team_reviewers: release From ff51e96fbd864504494ab301edfe955a2f030d47 Mon Sep 17 00:00:00 2001 From: Christian Schwarz Date: Mon, 27 Mar 2023 12:45:10 +0200 Subject: [PATCH 15/23] fix synthetic size for (last_record_lsn - gc_horizon) < initdb_lsn (#3874) fix synthetic size for (last_record_lsn - gc_horizon) < initdb_lsn Assume a single-timeline project. If the gc_horizon covers all WAL (last_record_lsn < gc_horizon) but we have written more data than just initdb, the synthetic size calculation worker needs to calculate the logical size at LSN initdb_lsn (Segment BranchStart). Before this patch, that calculation would incorrectly return the initial logical size calculation result that we cache in the Timeline::initial_logical_size. Presumably, because there was confusion around initdb_lsn vs. initial size calculation. The fix is to only hand out the initialized_size() only if the LSN matches. The distinction in the metrics between "init logical size" and "logical size" was also incorrect because of the above. So, remove it. There was a special case for `size != 0`. This was to cover the case of LogicalSize::empty_initial(), but `initial_part_end` is `None` in that case, so the new `LogicalSize::initialized_size()` will return None in that case as well. Lastly, to prevent confusion like this in the future, rename all occurrences of `init_lsn` to either just `lsn` or a more specific name. Co-authored-by: Joonas Koivunen Co-authored-by: Heikki Linnakangas --- pageserver/src/metrics.rs | 4 --- pageserver/src/tenant/timeline.rs | 43 ++++++++++++++----------------- 2 files changed, 19 insertions(+), 28 deletions(-) diff --git a/pageserver/src/metrics.rs b/pageserver/src/metrics.rs index b5563ad186..6cb245aed7 100644 --- a/pageserver/src/metrics.rs +++ b/pageserver/src/metrics.rs @@ -586,7 +586,6 @@ pub struct TimelineMetrics { pub flush_time_histo: StorageTimeMetrics, pub compact_time_histo: StorageTimeMetrics, pub create_images_time_histo: StorageTimeMetrics, - pub init_logical_size_histo: StorageTimeMetrics, pub logical_size_histo: StorageTimeMetrics, pub load_layer_map_histo: StorageTimeMetrics, pub garbage_collect_histo: StorageTimeMetrics, @@ -619,8 +618,6 @@ impl TimelineMetrics { let compact_time_histo = StorageTimeMetrics::new("compact", &tenant_id, &timeline_id); let create_images_time_histo = StorageTimeMetrics::new("create images", &tenant_id, &timeline_id); - let init_logical_size_histo = - StorageTimeMetrics::new("init logical size", &tenant_id, &timeline_id); let logical_size_histo = StorageTimeMetrics::new("logical size", &tenant_id, &timeline_id); let load_layer_map_histo = StorageTimeMetrics::new("load layer map", &tenant_id, &timeline_id); @@ -657,7 +654,6 @@ impl TimelineMetrics { flush_time_histo, compact_time_histo, create_images_time_histo, - init_logical_size_histo, logical_size_histo, garbage_collect_histo, load_layer_map_histo, diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs index 33909e749b..5fde1a77e0 100644 --- a/pageserver/src/tenant/timeline.rs +++ b/pageserver/src/tenant/timeline.rs @@ -328,9 +328,13 @@ impl LogicalSize { .fetch_add(delta, AtomicOrdering::SeqCst); } - /// Returns the initialized (already calculated) value, if any. - fn initialized_size(&self) -> Option { - self.initial_logical_size.get().copied() + /// Make the value computed by initial logical size computation + /// available for re-use. This doesn't contain the incremental part. + fn initialized_size(&self, lsn: Lsn) -> Option { + match self.initial_part_end { + Some(v) if v == lsn => self.initial_logical_size.get().copied(), + _ => None, + } } } @@ -806,11 +810,11 @@ impl Timeline { let mut is_exact = true; let size = current_size.size(); - if let (CurrentLogicalSize::Approximate(_), Some(init_lsn)) = + if let (CurrentLogicalSize::Approximate(_), Some(initial_part_end)) = (current_size, self.current_logical_size.initial_part_end) { is_exact = false; - self.try_spawn_size_init_task(init_lsn, ctx); + self.try_spawn_size_init_task(initial_part_end, ctx); } Ok((size, is_exact)) @@ -1688,7 +1692,7 @@ impl Timeline { Ok(()) } - fn try_spawn_size_init_task(self: &Arc, init_lsn: Lsn, ctx: &RequestContext) { + fn try_spawn_size_init_task(self: &Arc, lsn: Lsn, ctx: &RequestContext) { let permit = match Arc::clone(&self.current_logical_size.initial_size_computation) .try_acquire_owned() { @@ -1726,7 +1730,7 @@ impl Timeline { // NB: don't log errors here, task_mgr will do that. async move { let calculated_size = match self_clone - .logical_size_calculation_task(init_lsn, &background_ctx) + .logical_size_calculation_task(lsn, &background_ctx) .await { Ok(s) => s, @@ -1811,7 +1815,7 @@ impl Timeline { #[instrument(skip_all, fields(tenant = %self.tenant_id, timeline = %self.timeline_id))] async fn logical_size_calculation_task( self: &Arc, - init_lsn: Lsn, + lsn: Lsn, ctx: &RequestContext, ) -> Result { let mut timeline_state_updates = self.subscribe_for_state_updates(); @@ -1822,7 +1826,7 @@ impl Timeline { let cancel = cancel.child_token(); let ctx = ctx.attached_child(); self_calculation - .calculate_logical_size(init_lsn, cancel, &ctx) + .calculate_logical_size(lsn, cancel, &ctx) .await }; let timeline_state_cancellation = async { @@ -1906,21 +1910,12 @@ impl Timeline { // need to return something Ok(0) }); - let timer = if up_to_lsn == self.initdb_lsn { - if let Some(size) = self.current_logical_size.initialized_size() { - if size != 0 { - // non-zero size means that the size has already been calculated by this method - // after startup. if the logical size is for a new timeline without layers the - // size will be zero, and we cannot use that, or this caching strategy until - // pageserver restart. - return Ok(size); - } - } - - self.metrics.init_logical_size_histo.start_timer() - } else { - self.metrics.logical_size_histo.start_timer() - }; + // See if we've already done the work for initial size calculation. + // This is a short-cut for timelines that are mostly unused. + if let Some(size) = self.current_logical_size.initialized_size(up_to_lsn) { + return Ok(size); + } + let timer = self.metrics.logical_size_histo.start_timer(); let logical_size = self .get_current_logical_size_non_incremental(up_to_lsn, cancel, ctx) .await?; From fe156245708525d87bd3682595a3383e389efc65 Mon Sep 17 00:00:00 2001 From: Christian Schwarz Date: Mon, 27 Mar 2023 13:33:40 +0200 Subject: [PATCH 16/23] eviction_task: only refresh layer accesses once per p.threshold (#3877) Without this, we run it every p.period, which can be quite low. For example, the running experiment with 3000 tenants in prod uses a period of 1 minute. Doing it once per p.threshold is enough to prevent eviction. --- pageserver/src/tenant/timeline.rs | 8 ++++++++ pageserver/src/tenant/timeline/eviction_task.rs | 15 ++++++++++++++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs index 5fde1a77e0..dfa0e842f1 100644 --- a/pageserver/src/tenant/timeline.rs +++ b/pageserver/src/tenant/timeline.rs @@ -71,6 +71,8 @@ use crate::ZERO_PAGE; use crate::{is_temporary, task_mgr}; use walreceiver::spawn_connection_manager_task; +use self::eviction_task::EvictionTaskTimelineState; + use super::layer_map::BatchedUpdates; use super::remote_timeline_client::index::IndexPart; use super::remote_timeline_client::RemoteTimelineClient; @@ -216,6 +218,8 @@ pub struct Timeline { download_all_remote_layers_task_info: RwLock>, state: watch::Sender, + + eviction_task_timeline_state: tokio::sync::Mutex, } /// Internal structure to hold all data needed for logical size calculation. @@ -1252,6 +1256,10 @@ impl Timeline { download_all_remote_layers_task_info: RwLock::new(None), state, + + eviction_task_timeline_state: tokio::sync::Mutex::new( + EvictionTaskTimelineState::default(), + ), }; result.repartition_threshold = result.get_checkpoint_distance() / 10; result diff --git a/pageserver/src/tenant/timeline/eviction_task.rs b/pageserver/src/tenant/timeline/eviction_task.rs index 666768ff87..06dfe7a0b9 100644 --- a/pageserver/src/tenant/timeline/eviction_task.rs +++ b/pageserver/src/tenant/timeline/eviction_task.rs @@ -35,6 +35,11 @@ use crate::{ use super::Timeline; +#[derive(Default)] +pub struct EvictionTaskTimelineState { + last_refresh_required_in_restart: Option, +} + impl Timeline { pub(super) fn launch_eviction_task(self: &Arc) { let self_clone = Arc::clone(self); @@ -139,7 +144,15 @@ impl Timeline { // for active tenants this will likely materialized page cache or in-memory layers. for // inactive tenants it will refresh the last_access timestamps so that we will not evict // and re-download on restart these layers. - self.refresh_layers_required_in_restart(cancel, ctx).await; + let mut state = self.eviction_task_timeline_state.lock().await; + match state.last_refresh_required_in_restart { + Some(ts) if ts.elapsed() < p.threshold => { /* no need to run */ } + _ => { + self.refresh_layers_required_in_restart(cancel, ctx).await; + state.last_refresh_required_in_restart = Some(tokio::time::Instant::now()) + } + } + drop(state); if cancel.is_cancelled() { return ControlFlow::Break(()); From f14895b48ed6bea51b513f6650f89338d931974d Mon Sep 17 00:00:00 2001 From: Joonas Koivunen Date: Mon, 27 Mar 2023 20:20:23 +0300 Subject: [PATCH 17/23] eviction: avoid post-restart download by synthetic_size (#3871) As of #3867, we do artificial layer accesses to layers that will be needed after the next restart, but not until then because of caches. With this patch, we also do that for the accesses that the synthetic size calculation worker does if consumption metrics are enabled. The actual size calculation is not of importance, but we need to calculate all of the sizes, so we only call tenant::size::gather_inputs. Co-authored-by: Christian Schwarz --- pageserver/src/config.rs | 42 +++-- pageserver/src/tenant.rs | 4 + pageserver/src/tenant/size.rs | 9 +- pageserver/src/tenant/timeline.rs | 13 +- .../src/tenant/timeline/eviction_task.rs | 166 +++++++++++++++--- 5 files changed, 197 insertions(+), 37 deletions(-) diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 39282ce320..58a6056385 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -165,6 +165,10 @@ pub struct PageServerConf { /// Number of concurrent [`Tenant::gather_size_inputs`] allowed. pub concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore, + /// Limit of concurrent [`Tenant::gather_size_inputs`] issued by module `eviction_task`. + /// The number of permits is the same as `concurrent_tenant_size_logical_size_queries`. + /// See the comment in `eviction_task` for details. + pub eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore, // How often to collect metrics and send them to the metrics endpoint. pub metric_collection_interval: Duration, @@ -239,7 +243,7 @@ struct PageServerConfigBuilder { log_format: BuilderValue, - concurrent_tenant_size_logical_size_queries: BuilderValue, + concurrent_tenant_size_logical_size_queries: BuilderValue, metric_collection_interval: BuilderValue, cached_metric_collection_interval: BuilderValue, @@ -286,7 +290,9 @@ impl Default for PageServerConfigBuilder { .expect("cannot parse default keepalive interval")), log_format: Set(LogFormat::from_str(DEFAULT_LOG_FORMAT).unwrap()), - concurrent_tenant_size_logical_size_queries: Set(ConfigurableSemaphore::default()), + concurrent_tenant_size_logical_size_queries: Set( + ConfigurableSemaphore::DEFAULT_INITIAL, + ), metric_collection_interval: Set(humantime::parse_duration( DEFAULT_METRIC_COLLECTION_INTERVAL, ) @@ -389,7 +395,7 @@ impl PageServerConfigBuilder { self.log_format = BuilderValue::Set(log_format) } - pub fn concurrent_tenant_size_logical_size_queries(&mut self, u: ConfigurableSemaphore) { + pub fn concurrent_tenant_size_logical_size_queries(&mut self, u: NonZeroUsize) { self.concurrent_tenant_size_logical_size_queries = BuilderValue::Set(u); } @@ -434,6 +440,11 @@ impl PageServerConfigBuilder { } pub fn build(self) -> anyhow::Result { + let concurrent_tenant_size_logical_size_queries = self + .concurrent_tenant_size_logical_size_queries + .ok_or(anyhow!( + "missing concurrent_tenant_size_logical_size_queries" + ))?; Ok(PageServerConf { listen_pg_addr: self .listen_pg_addr @@ -481,11 +492,12 @@ impl PageServerConfigBuilder { .broker_keepalive_interval .ok_or(anyhow!("No broker keepalive interval provided"))?, log_format: self.log_format.ok_or(anyhow!("missing log_format"))?, - concurrent_tenant_size_logical_size_queries: self - .concurrent_tenant_size_logical_size_queries - .ok_or(anyhow!( - "missing concurrent_tenant_size_logical_size_queries" - ))?, + concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::new( + concurrent_tenant_size_logical_size_queries, + ), + eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::new( + concurrent_tenant_size_logical_size_queries, + ), metric_collection_interval: self .metric_collection_interval .ok_or(anyhow!("missing metric_collection_interval"))?, @@ -680,8 +692,7 @@ impl PageServerConf { "concurrent_tenant_size_logical_size_queries" => builder.concurrent_tenant_size_logical_size_queries({ let input = parse_toml_string(key, item)?; let permits = input.parse::().context("expected a number of initial permits, not {s:?}")?; - let permits = NonZeroUsize::new(permits).context("initial semaphore permits out of range: 0, use other configuration to disable a feature")?; - ConfigurableSemaphore::new(permits) + NonZeroUsize::new(permits).context("initial semaphore permits out of range: 0, use other configuration to disable a feature")? }), "metric_collection_interval" => builder.metric_collection_interval(parse_toml_duration(key, item)?), "cached_metric_collection_interval" => builder.cached_metric_collection_interval(parse_toml_duration(key, item)?), @@ -829,6 +840,8 @@ impl PageServerConf { broker_keepalive_interval: Duration::from_secs(5000), log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(), concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(), + eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::default( + ), metric_collection_interval: Duration::from_secs(60), cached_metric_collection_interval: Duration::from_secs(60 * 60), metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT, @@ -921,6 +934,11 @@ impl ConfigurableSemaphore { inner: std::sync::Arc::new(tokio::sync::Semaphore::new(initial_permits.get())), } } + + /// Returns the configured amount of permits. + pub fn initial_permits(&self) -> NonZeroUsize { + self.initial_permits + } } impl Default for ConfigurableSemaphore { @@ -1025,6 +1043,8 @@ log_format = 'json' )?, log_format: LogFormat::from_str(defaults::DEFAULT_LOG_FORMAT).unwrap(), concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(), + eviction_task_immitated_concurrent_logical_size_queries: + ConfigurableSemaphore::default(), metric_collection_interval: humantime::parse_duration( defaults::DEFAULT_METRIC_COLLECTION_INTERVAL )?, @@ -1085,6 +1105,8 @@ log_format = 'json' broker_keepalive_interval: Duration::from_secs(5), log_format: LogFormat::Json, concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(), + eviction_task_immitated_concurrent_logical_size_queries: + ConfigurableSemaphore::default(), metric_collection_interval: Duration::from_secs(222), cached_metric_collection_interval: Duration::from_secs(22200), metric_collection_endpoint: Some(Url::parse("http://localhost:80/metrics")?), diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs index 0a167fd787..2c5226e5bc 100644 --- a/pageserver/src/tenant.rs +++ b/pageserver/src/tenant.rs @@ -46,6 +46,7 @@ use std::time::{Duration, Instant}; use self::config::TenantConf; use self::metadata::TimelineMetadata; use self::remote_timeline_client::RemoteTimelineClient; +use self::timeline::EvictionTaskTenantState; use crate::config::PageServerConf; use crate::context::{DownloadBehavior, RequestContext}; use crate::import_datadir; @@ -142,6 +143,8 @@ pub struct Tenant { /// Cached logical sizes updated updated on each [`Tenant::gather_size_inputs`]. cached_logical_sizes: tokio::sync::Mutex>, cached_synthetic_tenant_size: Arc, + + eviction_task_tenant_state: tokio::sync::Mutex, } /// A timeline with some of its files on disk, being initialized. @@ -1781,6 +1784,7 @@ impl Tenant { state, cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()), cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)), + eviction_task_tenant_state: tokio::sync::Mutex::new(EvictionTaskTenantState::default()), } } diff --git a/pageserver/src/tenant/size.rs b/pageserver/src/tenant/size.rs index a41889f16d..77275f96bd 100644 --- a/pageserver/src/tenant/size.rs +++ b/pageserver/src/tenant/size.rs @@ -6,6 +6,7 @@ use std::sync::Arc; use anyhow::{bail, Context}; use tokio::sync::oneshot::error::RecvError; use tokio::sync::Semaphore; +use tokio_util::sync::CancellationToken; use crate::context::RequestContext; use crate::pgdatadir_mapping::CalculateLogicalSizeError; @@ -352,6 +353,10 @@ async fn fill_logical_sizes( // our advantage with `?` error handling. let mut joinset = tokio::task::JoinSet::new(); + let cancel = tokio_util::sync::CancellationToken::new(); + // be sure to cancel all spawned tasks if we are dropped + let _dg = cancel.clone().drop_guard(); + // For each point that would benefit from having a logical size available, // spawn a Task to fetch it, unless we have it cached already. for seg in segments.iter() { @@ -373,6 +378,7 @@ async fn fill_logical_sizes( timeline, lsn, ctx, + cancel.child_token(), )); } e.insert(cached_size); @@ -477,13 +483,14 @@ async fn calculate_logical_size( timeline: Arc, lsn: utils::lsn::Lsn, ctx: RequestContext, + cancel: CancellationToken, ) -> Result { let _permit = tokio::sync::Semaphore::acquire_owned(limit) .await .expect("global semaphore should not had been closed"); let size_res = timeline - .spawn_ondemand_logical_size_calculation(lsn, ctx) + .spawn_ondemand_logical_size_calculation(lsn, ctx, cancel) .instrument(info_span!("spawn_ondemand_logical_size_calculation")) .await?; Ok(TimelineAtLsnSizeResult(timeline, lsn, size_res)) diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs index dfa0e842f1..611c2c27d3 100644 --- a/pageserver/src/tenant/timeline.rs +++ b/pageserver/src/tenant/timeline.rs @@ -71,6 +71,7 @@ use crate::ZERO_PAGE; use crate::{is_temporary, task_mgr}; use walreceiver::spawn_connection_manager_task; +pub(super) use self::eviction_task::EvictionTaskTenantState; use self::eviction_task::EvictionTaskTimelineState; use super::layer_map::BatchedUpdates; @@ -1737,8 +1738,11 @@ impl Timeline { false, // NB: don't log errors here, task_mgr will do that. async move { + // no cancellation here, because nothing really waits for this to complete compared + // to spawn_ondemand_logical_size_calculation. + let cancel = CancellationToken::new(); let calculated_size = match self_clone - .logical_size_calculation_task(lsn, &background_ctx) + .logical_size_calculation_task(lsn, &background_ctx, cancel) .await { Ok(s) => s, @@ -1793,6 +1797,7 @@ impl Timeline { self: &Arc, lsn: Lsn, ctx: RequestContext, + cancel: CancellationToken, ) -> oneshot::Receiver> { let (sender, receiver) = oneshot::channel(); let self_clone = Arc::clone(self); @@ -1812,7 +1817,9 @@ impl Timeline { "ondemand logical size calculation", false, async move { - let res = self_clone.logical_size_calculation_task(lsn, &ctx).await; + let res = self_clone + .logical_size_calculation_task(lsn, &ctx, cancel) + .await; let _ = sender.send(res).ok(); Ok(()) // Receiver is responsible for handling errors }, @@ -1825,10 +1832,10 @@ impl Timeline { self: &Arc, lsn: Lsn, ctx: &RequestContext, + cancel: CancellationToken, ) -> Result { let mut timeline_state_updates = self.subscribe_for_state_updates(); let self_calculation = Arc::clone(self); - let cancel = CancellationToken::new(); let calculation = async { let cancel = cancel.child_token(); diff --git a/pageserver/src/tenant/timeline/eviction_task.rs b/pageserver/src/tenant/timeline/eviction_task.rs index 06dfe7a0b9..3ec8c30d70 100644 --- a/pageserver/src/tenant/timeline/eviction_task.rs +++ b/pageserver/src/tenant/timeline/eviction_task.rs @@ -14,6 +14,7 @@ //! //! See write-up on restart on-demand download spike: use std::{ + collections::HashMap, ops::ControlFlow, sync::Arc, time::{Duration, SystemTime}, @@ -30,6 +31,7 @@ use crate::{ tenant::{ config::{EvictionPolicy, EvictionPolicyLayerAccessThreshold}, storage_layer::PersistentLayer, + Tenant, }, }; @@ -37,7 +39,12 @@ use super::Timeline; #[derive(Default)] pub struct EvictionTaskTimelineState { - last_refresh_required_in_restart: Option, + last_layer_access_imitation: Option, +} + +#[derive(Default)] +pub struct EvictionTaskTenantState { + last_layer_access_imitation: Option, } impl Timeline { @@ -127,6 +134,35 @@ impl Timeline { ) -> ControlFlow<()> { let now = SystemTime::now(); + // If we evict layers but keep cached values derived from those layers, then + // we face a storm of on-demand downloads after pageserver restart. + // The reason is that the restart empties the caches, and so, the values + // need to be re-computed by accessing layers, which we evicted while the + // caches were filled. + // + // Solutions here would be one of the following: + // 1. Have a persistent cache. + // 2. Count every access to a cached value to the access stats of all layers + // that were accessed to compute the value in the first place. + // 3. Invalidate the caches at a period of < p.threshold/2, so that the values + // get re-computed from layers, thereby counting towards layer access stats. + // 4. Make the eviction task imitate the layer accesses that typically hit caches. + // + // We follow approach (4) here because in Neon prod deployment: + // - page cache is quite small => high churn => low hit rate + // => eviction gets correct access stats + // - value-level caches such as logical size & repatition have a high hit rate, + // especially for inactive tenants + // => eviction sees zero accesses for these + // => they cause the on-demand download storm on pageserver restart + // + // We should probably move to persistent caches in the future, or avoid + // having inactive tenants attached to pageserver in the first place. + match self.imitate_layer_accesses(p, cancel, ctx).await { + ControlFlow::Break(()) => return ControlFlow::Break(()), + ControlFlow::Continue(()) => (), + } + #[allow(dead_code)] #[derive(Debug, Default)] struct EvictionStats { @@ -137,27 +173,6 @@ impl Timeline { skipped_for_shutdown: usize, } - // what we want is to invalidate any caches which haven't been accessed for `p.threshold`, - // but we cannot actually do it for current limitations except by restarting pageserver. we - // just recompute the values which would be recomputed on startup. - // - // for active tenants this will likely materialized page cache or in-memory layers. for - // inactive tenants it will refresh the last_access timestamps so that we will not evict - // and re-download on restart these layers. - let mut state = self.eviction_task_timeline_state.lock().await; - match state.last_refresh_required_in_restart { - Some(ts) if ts.elapsed() < p.threshold => { /* no need to run */ } - _ => { - self.refresh_layers_required_in_restart(cancel, ctx).await; - state.last_refresh_required_in_restart = Some(tokio::time::Instant::now()) - } - } - drop(state); - - if cancel.is_cancelled() { - return ControlFlow::Break(()); - } - let mut stats = EvictionStats::default(); // Gather layers for eviction. // NB: all the checks can be invalidated as soon as we release the layer map lock. @@ -261,8 +276,55 @@ impl Timeline { ControlFlow::Continue(()) } + async fn imitate_layer_accesses( + &self, + p: &EvictionPolicyLayerAccessThreshold, + cancel: &CancellationToken, + ctx: &RequestContext, + ) -> ControlFlow<()> { + let mut state = self.eviction_task_timeline_state.lock().await; + match state.last_layer_access_imitation { + Some(ts) if ts.elapsed() < p.threshold => { /* no need to run */ } + _ => { + self.imitate_timeline_cached_layer_accesses(cancel, ctx) + .await; + state.last_layer_access_imitation = Some(tokio::time::Instant::now()) + } + } + drop(state); + + if cancel.is_cancelled() { + return ControlFlow::Break(()); + } + + // This task is timeline-scoped, but the synthetic size calculation is tenant-scoped. + // Make one of the tenant's timelines draw the short straw and run the calculation. + // The others wait until the calculation is done so that they take into account the + // imitated accesses that the winner made. + let Ok(tenant) = crate::tenant::mgr::get_tenant(self.tenant_id, true).await else { + // likely, we're shutting down + return ControlFlow::Break(()); + }; + let mut state = tenant.eviction_task_tenant_state.lock().await; + match state.last_layer_access_imitation { + Some(ts) if ts.elapsed() < p.threshold => { /* no need to run */ } + _ => { + self.imitate_synthetic_size_calculation_worker(&tenant, ctx, cancel) + .await; + state.last_layer_access_imitation = Some(tokio::time::Instant::now()); + } + } + drop(state); + + if cancel.is_cancelled() { + return ControlFlow::Break(()); + } + + ControlFlow::Continue(()) + } + /// Recompute the values which would cause on-demand downloads during restart. - async fn refresh_layers_required_in_restart( + async fn imitate_timeline_cached_layer_accesses( &self, cancel: &CancellationToken, ctx: &RequestContext, @@ -296,4 +358,62 @@ impl Timeline { } } } + + // Imitate the synthetic size calculation done by the consumption_metrics module. + async fn imitate_synthetic_size_calculation_worker( + &self, + tenant: &Arc, + ctx: &RequestContext, + cancel: &CancellationToken, + ) { + if self.conf.metric_collection_endpoint.is_none() { + // We don't start the consumption metrics task if this is not set in the config. + // So, no need to imitate the accesses in that case. + return; + } + + // The consumption metrics are collected on a per-tenant basis, by a single + // global background loop. + // It limits the number of synthetic size calculations using the global + // `concurrent_tenant_size_logical_size_queries` semaphore to not overload + // the pageserver. (size calculation is somewhat expensive in terms of CPU and IOs). + // + // If we used that same semaphore here, then we'd compete for the + // same permits, which may impact timeliness of consumption metrics. + // That is a no-go, as consumption metrics are much more important + // than what we do here. + // + // So, we have a separate semaphore, initialized to the same + // number of permits as the `concurrent_tenant_size_logical_size_queries`. + // In the worst, we would have twice the amount of concurrenct size calculations. + // But in practice, the `p.threshold` >> `consumption metric interval`, and + // we spread out the eviction task using `random_init_delay`. + // So, the chance of the worst case is quite low in practice. + // It runs as a per-tenant task, but the eviction_task.rs is per-timeline. + // So, we must coordinate with other with other eviction tasks of this tenant. + let limit = self + .conf + .eviction_task_immitated_concurrent_logical_size_queries + .inner(); + + let mut throwaway_cache = HashMap::new(); + let gather = + crate::tenant::size::gather_inputs(tenant, limit, None, &mut throwaway_cache, ctx); + tokio::pin!(gather); + + tokio::select! { + _ = cancel.cancelled() => {} + gather_result = gather => { + match gather_result { + Ok(_) => {}, + Err(e) => { + // We don't care about the result, but, if it failed, we should log it, + // since consumption metric might be hitting the cached value and + // thus not encountering this error. + warn!("failed to imitate synthetic size calculation accesses: {e:#}") + } + } + } + } + } } From 6efea4344913bb4b7a135debf069efe1404aa2b9 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 27 Mar 2023 21:01:46 +0300 Subject: [PATCH 18/23] Use precondition failed code in delete_timeline when tenant is missing (#3884) This allows client to differentiate between missing tenant and missing timeline cases --- libs/utils/src/http/error.rs | 7 +++++++ pageserver/src/http/openapi_spec.yml | 14 ++++++++++++++ pageserver/src/http/routes.rs | 5 +++++ test_runner/regress/test_timeline_delete.py | 6 +++--- 4 files changed, 29 insertions(+), 3 deletions(-) diff --git a/libs/utils/src/http/error.rs b/libs/utils/src/http/error.rs index 1ba0422993..3c6023eb80 100644 --- a/libs/utils/src/http/error.rs +++ b/libs/utils/src/http/error.rs @@ -20,6 +20,9 @@ pub enum ApiError { #[error("Conflict: {0}")] Conflict(String), + #[error("Precondition failed: {0}")] + PreconditionFailed(&'static str), + #[error(transparent)] InternalServerError(anyhow::Error), } @@ -44,6 +47,10 @@ impl ApiError { ApiError::Conflict(_) => { HttpErrorBody::response_from_msg_and_status(self.to_string(), StatusCode::CONFLICT) } + ApiError::PreconditionFailed(_) => HttpErrorBody::response_from_msg_and_status( + self.to_string(), + StatusCode::PRECONDITION_FAILED, + ), ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status( err.to_string(), StatusCode::INTERNAL_SERVER_ERROR, diff --git a/pageserver/src/http/openapi_spec.yml b/pageserver/src/http/openapi_spec.yml index b8c3bffcd5..795c0cd3c4 100644 --- a/pageserver/src/http/openapi_spec.yml +++ b/pageserver/src/http/openapi_spec.yml @@ -189,6 +189,13 @@ paths: application/json: schema: $ref: "#/components/schemas/NotFoundError" + "412": + description: Tenant is missing + content: + application/json: + schema: + $ref: "#/components/schemas/PreconditionFailedError" + "500": description: Generic operation error content: @@ -958,6 +965,13 @@ components: properties: msg: type: string + PreconditionFailedError: + type: object + required: + - msg + properties: + msg: + type: string security: - JWT: [] diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index ba53729ea9..b0addc82f1 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -148,6 +148,11 @@ impl From for ApiError { fn from(value: crate::tenant::mgr::DeleteTimelineError) -> Self { use crate::tenant::mgr::DeleteTimelineError::*; match value { + // Report Precondition failed so client can distinguish between + // "tenant is missing" case from "timeline is missing" + Tenant(TenantStateError::NotFound(..)) => { + ApiError::PreconditionFailed("Requested tenant is missing") + } Tenant(t) => ApiError::from(t), Timeline(t) => ApiError::from(t), } diff --git a/test_runner/regress/test_timeline_delete.py b/test_runner/regress/test_timeline_delete.py index 30d894e04c..93fafff934 100644 --- a/test_runner/regress/test_timeline_delete.py +++ b/test_runner/regress/test_timeline_delete.py @@ -10,7 +10,7 @@ def test_timeline_delete(neon_simple_env: NeonEnv): env.pageserver.allowed_errors.append(".*Timeline .* was not found.*") env.pageserver.allowed_errors.append(".*timeline not found.*") env.pageserver.allowed_errors.append(".*Cannot delete timeline which has child timelines.*") - env.pageserver.allowed_errors.append(".*NotFound: tenant .*") + env.pageserver.allowed_errors.append(".*Precondition failed: Requested tenant is missing.*") ps_http = env.pageserver.http_client() @@ -24,11 +24,11 @@ def test_timeline_delete(neon_simple_env: NeonEnv): invalid_tenant_id = TenantId.generate() with pytest.raises( PageserverApiException, - match=f"NotFound: tenant {invalid_tenant_id}", + match="Precondition failed: Requested tenant is missing", ) as exc: ps_http.timeline_delete(tenant_id=invalid_tenant_id, timeline_id=invalid_timeline_id) - assert exc.value.status_code == 404 + assert exc.value.status_code == 412 # construct pair of branches to validate that pageserver prohibits # deletion of ancestor timelines when they have child branches From 82a47770467553a2950cc46dea61d54464e29bca Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Tue, 28 Mar 2023 08:27:50 +0300 Subject: [PATCH 19/23] Add local free space monitor (#3832) ## Describe your changes Monitor free spae in local file system and shrink local file cache size if it is under watermark. Neon is using local storage for temp files (temp table + intermediate results), unlogged relations and local file cache. Ideally all space not used for temporary files should be used for local file cache. Temporary files and even unlogged relation are intended to have small life time (because them can be lost at any moment in case of compute restart). So the policy is to overcommit local cache size and shrink it if there is not enough free space. As far as temporary files are expected to be needed for a short time, there i no need to permanently shrink local file cache size. Instead of it, we just throw away least recently accessed elements from local file cache, releasing some space on the local disk. ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: sharnoff --- pgxn/neon/file_cache.c | 110 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 107 insertions(+), 3 deletions(-) diff --git a/pgxn/neon/file_cache.c b/pgxn/neon/file_cache.c index 3a2ac380f9..143ad4bf67 100644 --- a/pgxn/neon/file_cache.c +++ b/pgxn/neon/file_cache.c @@ -14,6 +14,7 @@ */ #include +#include #include #include @@ -34,6 +35,9 @@ #include "storage/fd.h" #include "storage/pg_shmem.h" #include "storage/buf_internals.h" +#include "storage/procsignal.h" +#include "postmaster/bgworker.h" +#include "postmaster/interrupt.h" /* * Local file cache is used to temporary store relations pages in local file system. @@ -59,6 +63,9 @@ #define SIZE_MB_TO_CHUNKS(size) ((uint32)((size) * MB / BLCKSZ / BLOCKS_PER_CHUNK)) +#define MAX_MONITOR_INTERVAL_USEC 1000000 /* 1 second */ +#define MAX_DISK_WRITE_RATE 1000 /* MB/sec */ + typedef struct FileCacheEntry { BufferTag key; @@ -71,6 +78,7 @@ typedef struct FileCacheEntry typedef struct FileCacheControl { uint32 size; /* size of cache file in chunks */ + uint32 used; /* number of used chunks */ dlist_head lru; /* double linked list for LRU replacement algorithm */ } FileCacheControl; @@ -79,12 +87,14 @@ static int lfc_desc; static LWLockId lfc_lock; static int lfc_max_size; static int lfc_size_limit; +static int lfc_free_space_watermark; static char* lfc_path; static FileCacheControl* lfc_ctl; static shmem_startup_hook_type prev_shmem_startup_hook; #if PG_VERSION_NUM>=150000 static shmem_request_hook_type prev_shmem_request_hook; #endif +static int lfc_shrinking_factor; /* power of two by which local cache size will be shrinked when lfc_free_space_watermark is reached */ static void lfc_shmem_startup(void) @@ -112,6 +122,7 @@ lfc_shmem_startup(void) &info, HASH_ELEM | HASH_BLOBS); lfc_ctl->size = 0; + lfc_ctl->used = 0; dlist_init(&lfc_ctl->lru); /* Remove file cache on restart */ @@ -165,7 +176,7 @@ lfc_change_limit_hook(int newval, void *extra) } } LWLockAcquire(lfc_lock, LW_EXCLUSIVE); - while (new_size < lfc_ctl->size && !dlist_is_empty(&lfc_ctl->lru)) + while (new_size < lfc_ctl->used && !dlist_is_empty(&lfc_ctl->lru)) { /* Shrink cache by throwing away least recently accessed chunks and returning their space to file system */ FileCacheEntry* victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru)); @@ -175,12 +186,86 @@ lfc_change_limit_hook(int newval, void *extra) elog(LOG, "Failed to punch hole in file: %m"); #endif hash_search(lfc_hash, &victim->key, HASH_REMOVE, NULL); - lfc_ctl->size -= 1; + lfc_ctl->used -= 1; } elog(LOG, "set local file cache limit to %d", new_size); LWLockRelease(lfc_lock); } +/* + * Local file system state monitor check available free space. + * If it is lower than lfc_free_space_watermark then we shrink size of local cache + * but throwing away least recently accessed chunks. + * First time low space watermark is reached cache size is divided by two, + * second time by four,... Finally we remove all chunks from local cache. + * + * Please notice that we are not changing lfc_cache_size: it is used to be adjusted by autoscaler. + * We only throw away cached chunks but do not prevent from filling cache by new chunks. + * + * Interval of poooling cache state is calculated as minimal time needed to consume lfc_free_space_watermark + * disk space with maximal possible disk write speed (1Gb/sec). But not larger than 1 second. + * Callinng statfs each second should not add some noticable overhead. + */ +void +FileCacheMonitorMain(Datum main_arg) +{ + /* + * Choose file system state monitor interval so that space can not be exosted + * during this period but not longer than MAX_MONITOR_INTERVAL (10 sec) + */ + uint64 monitor_interval = Min(MAX_MONITOR_INTERVAL_USEC, lfc_free_space_watermark*MB/MAX_DISK_WRITE_RATE); + + /* Establish signal handlers. */ + pqsignal(SIGUSR1, procsignal_sigusr1_handler); + pqsignal(SIGHUP, SignalHandlerForConfigReload); + pqsignal(SIGTERM, SignalHandlerForShutdownRequest); + BackgroundWorkerUnblockSignals(); + + /* Periodically dump buffers until terminated. */ + while (!ShutdownRequestPending) + { + if (lfc_size_limit != 0) + { + struct statfs sfs; + if (statfs(lfc_path, &sfs) < 0) + { + elog(WARNING, "Failed to obtain status of %s: %m", lfc_path); + } + else + { + if (sfs.f_bavail*sfs.f_bsize < lfc_free_space_watermark*MB) + { + if (lfc_shrinking_factor < 31) { + lfc_shrinking_factor += 1; + } + lfc_change_limit_hook(lfc_size_limit >> lfc_shrinking_factor, NULL); + } + else + lfc_shrinking_factor = 0; /* reset to initial value */ + } + } + pg_usleep(monitor_interval); + } +} + +static void +lfc_register_free_space_monitor(void) +{ + BackgroundWorker bgw; + memset(&bgw, 0, sizeof(bgw)); + bgw.bgw_flags = BGWORKER_SHMEM_ACCESS; + bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; + snprintf(bgw.bgw_library_name, BGW_MAXLEN, "neon"); + snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FileCacheMonitorMain"); + snprintf(bgw.bgw_name, BGW_MAXLEN, "Local free space monitor"); + snprintf(bgw.bgw_type, BGW_MAXLEN, "Local free space monitor"); + bgw.bgw_restart_time = 5; + bgw.bgw_notify_pid = 0; + bgw.bgw_main_arg = (Datum) 0; + + RegisterBackgroundWorker(&bgw); +} + void lfc_init(void) { @@ -217,6 +302,19 @@ lfc_init(void) lfc_change_limit_hook, NULL); + DefineCustomIntVariable("neon.free_space_watermark", + "Minimal free space in local file system after reaching which local file cache will be truncated", + NULL, + &lfc_free_space_watermark, + 1024, /* 1GB */ + 0, + INT_MAX, + PGC_SIGHUP, + GUC_UNIT_MB, + NULL, + NULL, + NULL); + DefineCustomStringVariable("neon.file_cache_path", "Path to local file cache (can be raw device)", NULL, @@ -231,6 +329,9 @@ lfc_init(void) if (lfc_max_size == 0) return; + if (lfc_free_space_watermark != 0) + lfc_register_free_space_monitor(); + prev_shmem_startup_hook = shmem_startup_hook; shmem_startup_hook = lfc_shmem_startup; #if PG_VERSION_NUM>=150000 @@ -380,7 +481,7 @@ lfc_write(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno, * there are should be very large number of concurrent IO operations and them are limited by max_connections, * we prefer not to complicate code and use second approach. */ - if (lfc_ctl->size >= SIZE_MB_TO_CHUNKS(lfc_size_limit) && !dlist_is_empty(&lfc_ctl->lru)) + if (lfc_ctl->used >= SIZE_MB_TO_CHUNKS(lfc_size_limit) && !dlist_is_empty(&lfc_ctl->lru)) { /* Cache overflow: evict least recently used chunk */ FileCacheEntry* victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru)); @@ -390,7 +491,10 @@ lfc_write(RelFileNode rnode, ForkNumber forkNum, BlockNumber blkno, elog(LOG, "Swap file cache page"); } else + { + lfc_ctl->used += 1; entry->offset = lfc_ctl->size++; /* allocate new chunk at end of file */ + } entry->access_count = 1; memset(entry->bitmap, 0, sizeof entry->bitmap); } From c30b9e6eb125c270d045030e544863bc69522003 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Fri, 24 Mar 2023 12:56:21 +0400 Subject: [PATCH 20/23] Show full path to pg_ctl invokation when it fails. --- control_plane/src/compute.rs | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/control_plane/src/compute.rs b/control_plane/src/compute.rs index 46f0ad8d4f..ee504bfaa6 100644 --- a/control_plane/src/compute.rs +++ b/control_plane/src/compute.rs @@ -403,7 +403,7 @@ impl PostgresNode { fn pg_ctl(&self, args: &[&str], auth_token: &Option) -> Result<()> { let pg_ctl_path = self.env.pg_bin_dir(self.pg_version)?.join("pg_ctl"); - let mut cmd = Command::new(pg_ctl_path); + let mut cmd = Command::new(&pg_ctl_path); cmd.args( [ &[ @@ -432,7 +432,9 @@ impl PostgresNode { cmd.env("NEON_AUTH_TOKEN", token); } - let pg_ctl = cmd.output().context("pg_ctl failed")?; + let pg_ctl = cmd + .output() + .context(format!("{} failed", pg_ctl_path.display()))?; if !pg_ctl.status.success() { anyhow::bail!( "pg_ctl failed, exit code: {}, stdout: {}, stderr: {}", From 278d0f117d3b73ee302dd8b09388f5988f7d5011 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Tue, 28 Mar 2023 13:46:47 +0400 Subject: [PATCH 21/23] Rename neon_local sk logs s/safekeeper 1.log/safekeeper-1.log. I don't like spaces in file names. --- control_plane/src/safekeeper.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index 84d6320573..d358f73343 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -156,7 +156,7 @@ impl SafekeeperNode { } background_process::start_process( - &format!("safekeeper {id}"), + &format!("safekeeper-{id}"), &datadir, &self.env.safekeeper_bin(), &args, From 35ecb139dc16f3c92b2cf6ca0afe0fd28fd86816 Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Tue, 28 Mar 2023 09:14:35 +0300 Subject: [PATCH 22/23] Use stavfs instead inof statfs to fix MacOS build --- pgxn/neon/file_cache.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/pgxn/neon/file_cache.c b/pgxn/neon/file_cache.c index 143ad4bf67..8fe53b1e7d 100644 --- a/pgxn/neon/file_cache.c +++ b/pgxn/neon/file_cache.c @@ -14,7 +14,7 @@ */ #include -#include +#include #include #include @@ -204,7 +204,7 @@ lfc_change_limit_hook(int newval, void *extra) * * Interval of poooling cache state is calculated as minimal time needed to consume lfc_free_space_watermark * disk space with maximal possible disk write speed (1Gb/sec). But not larger than 1 second. - * Callinng statfs each second should not add some noticable overhead. + * Callinng statvfs each second should not add some noticable overhead. */ void FileCacheMonitorMain(Datum main_arg) @@ -226,8 +226,8 @@ FileCacheMonitorMain(Datum main_arg) { if (lfc_size_limit != 0) { - struct statfs sfs; - if (statfs(lfc_path, &sfs) < 0) + struct statvfs sfs; + if (statvfs(lfc_path, &sfs) < 0) { elog(WARNING, "Failed to obtain status of %s: %m", lfc_path); } From 9798737ec65961351b55de07b3852a46d43052bf Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Tue, 28 Mar 2023 10:24:31 +0300 Subject: [PATCH 23/23] Update pgxn/neon/file_cache.c Co-authored-by: Heikki Linnakangas --- pgxn/neon/file_cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pgxn/neon/file_cache.c b/pgxn/neon/file_cache.c index 8fe53b1e7d..8dff259f02 100644 --- a/pgxn/neon/file_cache.c +++ b/pgxn/neon/file_cache.c @@ -204,7 +204,7 @@ lfc_change_limit_hook(int newval, void *extra) * * Interval of poooling cache state is calculated as minimal time needed to consume lfc_free_space_watermark * disk space with maximal possible disk write speed (1Gb/sec). But not larger than 1 second. - * Callinng statvfs each second should not add some noticable overhead. + * Calling statvfs each second should not add any noticeable overhead. */ void FileCacheMonitorMain(Datum main_arg)