fix(pageserver): abort on duplicate layers, before doing damage (#7799)

fixes https://github.com/neondatabase/neon/issues/7790 (duplicating most
of the issue description here for posterity)

# Background

From the time before always-authoritative `index_part.json`, we had to
handle duplicate layers. See the RFC for an illustration of how
duplicate layers could happen:
a8e6d259cb/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md (L41-L50)

As of #5198 , we should not be exposed to that problem anymore.

# Problem 1

We still have
1. [code in
Pageserver](82960b2175/pageserver/src/tenant/timeline.rs (L4502-L4521))
than handles duplicate layers
2. [tests in the test
suite](d9dcbffac3/test_runner/regress/test_duplicate_layers.py (L15))
that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce
a crash that could legitimately happen in production.
What is does instead is to return early with an `Ok()`, so that the code
in Pageserver that handles duplicate layers (item 1) actually gets
exercised.

That "return early" would be a bug in the routine if it happened in
production.
So, the tests in the test suite are tests for their own sake, but don't
serve to actually regress-test any production behavior.

# Problem 2

Further, if production code _did_ (it nowawdays doesn't!) create a
duplicate layer, the code in Pageserver that handles the condition (item
1 above) is too little and too late:

* the code handles it by discarding the newer `struct Layer`; that's
good.
* however, on disk, we have already overwritten the old with the new
layer file
* the fact that we do it atomically doesn't matter because ...
* if the new layer file is not bit-identical, then we have a cache
coherency problem
  * PS PageCache block cache: caches old bit battern
* blob_io offsets stored in variables, based on pre-overwrite bit
pattern / offsets
* => reading based on these offsets from the new file might yield
different data than before
 
# Solution

- Remove the test suite code pertaining to Problem 1
- Move & rename test suite code that actually tests RFC-27
crash-consistent layer map.
- Remove the Pageserver code that handles duplicate layers too late
(Problem 1)
- Use `RENAME_NOREPLACE` to prevent over-rename the file during
`.finish()`, bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the
layer map, as they don't even get a `struct Layer` at hand.
- Add `abort`s in the place where we have the layer map lock and check
for duplicates (Problem 2)
- Note again, we can't reach there because we bail from `.finish()` much
earlier in the code.
- Share the logic to clean up after failed `.finish()` between image
layers and delta layers (drive-by cleanup)
- This exposed that test `image_layer_rewrite` was overwriting layer
files in place. Fix the test.

# Future Work

This PR adds a new failure scenario that was previously "papered over"
by the overwriting of layers:
1. Start a compaction that will produce 3 layers: A, B, C
2. Layer A is `finish()`ed successfully.
3. Layer B fails mid-way at some `put_value()`.
4. Compaction bails out, sleeps 20s.
5. Some disk space gets freed in the meantime.
6. Compaction wakes from sleep, another iteration starts, it attempts to
write Layer A again. But the `.finish()` **fails because A already
exists on disk**.

The failure in step 5 is new with this PR, and it **causes the
compaction to get stuck**.
Before, it would silently overwrite the file and "successfully" complete
the second iteration.

The mitigation for this is to `/reset` the tenant.
This commit is contained in:
Christian Schwarz
2024-06-04 18:16:23 +02:00
committed by GitHub
parent fd22fc5b7d
commit 17116f2ea9
11 changed files with 252 additions and 132 deletions

View File

@@ -3,6 +3,9 @@ use std::{fs, io, path::Path};
use anyhow::Context;
mod rename_noreplace;
pub use rename_noreplace::rename_noreplace;
pub trait PathExt {
/// Returns an error if `self` is not a directory.
fn is_empty_dir(&self) -> io::Result<bool>;

View File

@@ -0,0 +1,109 @@
use nix::NixPath;
/// Rename a file without replacing an existing file.
///
/// This is a wrapper around platform-specific APIs.
pub fn rename_noreplace<P1: ?Sized + NixPath, P2: ?Sized + NixPath>(
src: &P1,
dst: &P2,
) -> nix::Result<()> {
{
#[cfg(target_os = "linux")]
{
nix::fcntl::renameat2(
None,
src,
None,
dst,
nix::fcntl::RenameFlags::RENAME_NOREPLACE,
)
}
#[cfg(target_os = "macos")]
{
let res = src.with_nix_path(|src| {
dst.with_nix_path(|dst|
// SAFETY: `src` and `dst` are valid C strings as per the NixPath trait and they outlive the call to renamex_np.
unsafe {
nix::libc::renamex_np(src.as_ptr(), dst.as_ptr(), nix::libc::RENAME_EXCL)
})
})??;
nix::errno::Errno::result(res).map(drop)
}
#[cfg(not(any(target_os = "linux", target_os = "macos")))]
{
std::compile_error!("OS does not support no-replace renames");
}
}
}
#[cfg(test)]
mod test {
use std::{fs, path::PathBuf};
use super::*;
fn testdir() -> camino_tempfile::Utf8TempDir {
match crate::env::var("NEON_UTILS_RENAME_NOREPLACE_TESTDIR") {
Some(path) => {
let path: camino::Utf8PathBuf = path;
camino_tempfile::tempdir_in(path).unwrap()
}
None => camino_tempfile::tempdir().unwrap(),
}
}
#[test]
fn test_absolute_paths() {
let testdir = testdir();
println!("testdir: {}", testdir.path());
let src = testdir.path().join("src");
let dst = testdir.path().join("dst");
fs::write(&src, b"").unwrap();
fs::write(&dst, b"").unwrap();
let src = src.canonicalize().unwrap();
assert!(src.is_absolute());
let dst = dst.canonicalize().unwrap();
assert!(dst.is_absolute());
let result = rename_noreplace(&src, &dst);
assert_eq!(result.unwrap_err(), nix::Error::EEXIST);
}
#[test]
fn test_relative_paths() {
let testdir = testdir();
println!("testdir: {}", testdir.path());
// this is fine because we run in nextest => process per test
std::env::set_current_dir(testdir.path()).unwrap();
let src = PathBuf::from("src");
let dst = PathBuf::from("dst");
fs::write(&src, b"").unwrap();
fs::write(&dst, b"").unwrap();
let result = rename_noreplace(&src, &dst);
assert_eq!(result.unwrap_err(), nix::Error::EEXIST);
}
#[test]
fn test_works_when_not_exists() {
let testdir = testdir();
println!("testdir: {}", testdir.path());
let src = testdir.path().join("src");
let dst = testdir.path().join("dst");
fs::write(&src, b"content").unwrap();
rename_noreplace(src.as_std_path(), dst.as_std_path()).unwrap();
assert_eq!(
"content",
String::from_utf8(std::fs::read(&dst).unwrap()).unwrap()
);
}
}

View File

@@ -3865,6 +3865,9 @@ pub(crate) mod harness {
pub fn create_custom(
test_name: &'static str,
tenant_conf: TenantConf,
tenant_id: TenantId,
shard_identity: ShardIdentity,
generation: Generation,
) -> anyhow::Result<Self> {
setup_logging();
@@ -3877,8 +3880,12 @@ pub(crate) mod harness {
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
let tenant_id = TenantId::generate();
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
let shard = shard_identity.shard_index();
let tenant_shard_id = TenantShardId {
tenant_id,
shard_number: shard.shard_number,
shard_count: shard.shard_count,
};
fs::create_dir_all(conf.tenant_path(&tenant_shard_id))?;
fs::create_dir_all(conf.timelines_path(&tenant_shard_id))?;
@@ -3896,8 +3903,8 @@ pub(crate) mod harness {
conf,
tenant_conf,
tenant_shard_id,
generation: Generation::new(0xdeadbeef),
shard: ShardIndex::unsharded(),
generation,
shard,
remote_storage,
remote_fs_dir,
deletion_queue,
@@ -3912,8 +3919,15 @@ pub(crate) mod harness {
compaction_period: Duration::ZERO,
..TenantConf::default()
};
Self::create_custom(test_name, tenant_conf)
let tenant_id = TenantId::generate();
let shard = ShardIdentity::unsharded();
Self::create_custom(
test_name,
tenant_conf,
tenant_id,
shard,
Generation::new(0xdeadbeef),
)
}
pub fn span(&self) -> tracing::Span {
@@ -4037,6 +4051,7 @@ mod tests {
use tests::storage_layer::ValuesReconstructState;
use tests::timeline::{GetVectoredError, ShutdownMode};
use utils::bin_ser::BeSer;
use utils::id::TenantId;
static TEST_KEY: Lazy<Key> =
Lazy::new(|| Key::from_slice(&hex!("010000000033333333444444445500000001")));
@@ -4936,7 +4951,13 @@ mod tests {
..TenantConf::default()
};
let harness = TenantHarness::create_custom("test_get_vectored_key_gap", tenant_conf)?;
let harness = TenantHarness::create_custom(
"test_get_vectored_key_gap",
tenant_conf,
TenantId::generate(),
ShardIdentity::unsharded(),
Generation::new(0xdeadbeef),
)?;
let (tenant, ctx) = harness.load().await;
let mut current_key = Key::from_hex("010000000033333333444444445500000000").unwrap();

View File

@@ -478,6 +478,23 @@ impl DeltaLayerWriterInner {
key_end: Key,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let temp_path = self.path.clone();
let result = self.finish0(key_end, timeline, ctx).await;
if result.is_err() {
tracing::info!(%temp_path, "cleaning up temporary file after error during writing");
if let Err(e) = std::fs::remove_file(&temp_path) {
tracing::warn!(error=%e, %temp_path, "error cleaning up temporary layer file after error during writing");
}
}
result
}
async fn finish0(
self,
key_end: Key,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
@@ -651,19 +668,11 @@ impl DeltaLayerWriter {
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let inner = self.inner.take().unwrap();
let temp_path = inner.path.clone();
let result = inner.finish(key_end, timeline, ctx).await;
// The delta layer files can sometimes be really large. Clean them up.
if result.is_err() {
tracing::warn!(
"Cleaning up temporary delta file {temp_path} after error during writing"
);
if let Err(e) = std::fs::remove_file(&temp_path) {
tracing::warn!("Error cleaning up temporary delta layer file {temp_path}: {e:?}")
}
}
result
self.inner
.take()
.unwrap()
.finish(key_end, timeline, ctx)
.await
}
}

View File

@@ -917,26 +917,57 @@ impl Drop for ImageLayerWriter {
#[cfg(test)]
mod test {
use std::time::Duration;
use bytes::Bytes;
use pageserver_api::{
key::Key,
shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize},
};
use utils::{id::TimelineId, lsn::Lsn};
use utils::{
generation::Generation,
id::{TenantId, TimelineId},
lsn::Lsn,
};
use crate::{tenant::harness::TenantHarness, DEFAULT_PG_VERSION};
use crate::{
tenant::{config::TenantConf, harness::TenantHarness},
DEFAULT_PG_VERSION,
};
use super::ImageLayerWriter;
#[tokio::test]
async fn image_layer_rewrite() {
let harness = TenantHarness::create("test_image_layer_rewrite").unwrap();
let (tenant, ctx) = harness.load().await;
let tenant_conf = TenantConf {
gc_period: Duration::ZERO,
compaction_period: Duration::ZERO,
..TenantConf::default()
};
let tenant_id = TenantId::generate();
let mut gen = Generation::new(0xdead0001);
let mut get_next_gen = || {
let ret = gen;
gen = gen.next();
ret
};
// The LSN at which we will create an image layer to filter
let lsn = Lsn(0xdeadbeef0000);
let timeline_id = TimelineId::generate();
//
// Create an unsharded parent with a layer.
//
let harness = TenantHarness::create_custom(
"test_image_layer_rewrite--parent",
tenant_conf.clone(),
tenant_id,
ShardIdentity::unsharded(),
get_next_gen(),
)
.unwrap();
let (tenant, ctx) = harness.load().await;
let timeline = tenant
.create_test_timeline(timeline_id, lsn, DEFAULT_PG_VERSION, &ctx)
.await
@@ -971,9 +1002,47 @@ mod test {
};
let original_size = resident.metadata().file_size;
//
// Create child shards and do the rewrite, exercising filter().
// TODO: abstraction in TenantHarness for splits.
//
// Filter for various shards: this exercises cases like values at start of key range, end of key
// range, middle of key range.
for shard_number in 0..4 {
let shard_count = ShardCount::new(4);
for shard_number in 0..shard_count.count() {
//
// mimic the shard split
//
let shard_identity = ShardIdentity::new(
ShardNumber(shard_number),
shard_count,
ShardStripeSize(0x8000),
)
.unwrap();
let harness = TenantHarness::create_custom(
Box::leak(Box::new(format!(
"test_image_layer_rewrite--child{}",
shard_identity.shard_slug()
))),
tenant_conf.clone(),
tenant_id,
shard_identity,
// NB: in reality, the shards would each fork off their own gen number sequence from the parent.
// But here, all we care about is that the gen number is unique.
get_next_gen(),
)
.unwrap();
let (tenant, ctx) = harness.load().await;
let timeline = tenant
.create_test_timeline(timeline_id, lsn, DEFAULT_PG_VERSION, &ctx)
.await
.unwrap();
//
// use filter() and make assertions
//
let mut filtered_writer = ImageLayerWriter::new(
harness.conf,
timeline_id,
@@ -985,15 +1054,6 @@ mod test {
.await
.unwrap();
// TenantHarness gave us an unsharded tenant, but we'll use a sharded ShardIdentity
// to exercise filter()
let shard_identity = ShardIdentity::new(
ShardNumber(shard_number),
ShardCount::new(4),
ShardStripeSize(0x8000),
)
.unwrap();
let wrote_keys = resident
.filter(&shard_identity, &mut filtered_writer, &ctx)
.await

View File

@@ -277,9 +277,10 @@ impl Layer {
let downloaded = resident.expect("just initialized");
// if the rename works, the path is as expected
// TODO: sync system call
std::fs::rename(temp_path, owner.local_path())
// We never want to overwrite an existing file, so we use `RENAME_NOREPLACE`.
// TODO: this leaves the temp file in place if the rename fails, risking us running
// out of space. Should we clean it up here or does the calling context deal with this?
utils::fs_ext::rename_noreplace(temp_path.as_std_path(), owner.local_path().as_std_path())
.with_context(|| format!("rename temporary file as correct path for {owner}"))?;
Ok(ResidentLayer { downloaded, owner })

View File

@@ -421,48 +421,6 @@ impl Timeline {
return Ok(CompactLevel0Phase1Result::default());
}
// This failpoint is used together with `test_duplicate_layers` integration test.
// It returns the compaction result exactly the same layers as input to compaction.
// We want to ensure that this will not cause any problem when updating the layer map
// after the compaction is finished.
//
// Currently, there are two rare edge cases that will cause duplicated layers being
// inserted.
// 1. The compaction job is inturrupted / did not finish successfully. Assume we have file 1, 2, 3, 4, which
// is compacted to 5, but the page server is shut down, next time we start page server we will get a layer
// map containing 1, 2, 3, 4, and 5, whereas 5 has the same content as 4. If we trigger L0 compation at this
// point again, it is likely that we will get a file 6 which has the same content and the key range as 5,
// and this causes an overwrite. This is acceptable because the content is the same, and we should do a
// layer replace instead of the normal remove / upload process.
// 2. The input workload pattern creates exactly n files that are sorted, non-overlapping and is of target file
// size length. Compaction will likely create the same set of n files afterwards.
//
// This failpoint is a superset of both of the cases.
if cfg!(feature = "testing") {
let active = (|| {
::fail::fail_point!("compact-level0-phase1-return-same", |_| true);
false
})();
if active {
let mut new_layers = Vec::with_capacity(level0_deltas.len());
for delta in &level0_deltas {
// we are just faking these layers as being produced again for this failpoint
new_layers.push(
delta
.download_and_keep_resident()
.await
.context("download layer for failpoint")?,
);
}
tracing::info!("compact-level0-phase1-return-same"); // so that we can check if we hit the failpoint
return Ok(CompactLevel0Phase1Result {
new_layers,
deltas_to_compact: level0_deltas,
});
}
}
// Gather the files to compact in this iteration.
//
// Start with the oldest Level 0 delta file, and collect any other

View File

@@ -238,7 +238,7 @@ def test_uploads_and_deletions(
# https://github.com/neondatabase/neon/issues/7707
# https://github.com/neondatabase/neon/issues/7759
allowed_errors = [
".*duplicated L1 layer.*",
".*/checkpoint.*rename temporary file as correct path for.*", # EEXIST
".*delta layer created with.*duplicate values.*",
".*assertion failed: self.lsn_range.start <= lsn.*",
".*HTTP request handler task panicked: task.*panicked.*",

View File

@@ -12,42 +12,14 @@ from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind
from requests.exceptions import ConnectionError
def test_duplicate_layers(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
env = neon_env_builder.init_start()
pageserver_http = env.pageserver.http_client()
# use a failpoint to return all L0s as L1s
message = ".*duplicated L1 layer layer=.*"
env.pageserver.allowed_errors.append(message)
# Use aggressive compaction and checkpoint settings
tenant_id, _ = env.neon_cli.create_tenant(
conf={
"checkpoint_distance": f"{1024 ** 2}",
"compaction_target_size": f"{1024 ** 2}",
"compaction_period": "5 s",
"compaction_threshold": "3",
}
)
pageserver_http.configure_failpoints(("compact-level0-phase1-return-same", "return"))
endpoint = env.endpoints.create_start("main", tenant_id=tenant_id)
connstr = endpoint.connstr(options="-csynchronous_commit=off")
pg_bin.run_capture(["pgbench", "-i", "-s1", connstr])
time.sleep(10) # let compaction to be performed
env.pageserver.assert_log_contains("compact-level0-phase1-return-same")
def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
def test_local_only_layers_after_crash(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
"""
Test sets fail point at the end of first compaction phase: after
flushing new L1 layer but before deletion of L0 layers.
Test case for docs/rfcs/027-crash-consistent-layer-map-through-index-part.md.
The L1 used to be overwritten, but with crash-consistency via remote
index_part.json, we end up deleting the not yet uploaded L1 layer on
startup.
Simulate crash after compaction has written layers to disk
but before they have been uploaded/linked into remote index_part.json.
Startup handles this situation by deleting the not yet uploaded L1 layer files.
"""
neon_env_builder.enable_pageserver_remote_storage(RemoteStorageKind.LOCAL_FS)
@@ -126,13 +98,6 @@ def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin)
# give time for log flush
time.sleep(1)
message = f".*duplicated L1 layer layer={l1_found}"
found_msg = env.pageserver.log_contains(message)
# resident or evicted, it should not be overwritten, however it should had been non-existing at startup
assert (
found_msg is None
), "layer should had been removed during startup, did it live on as evicted?"
assert env.pageserver.layer_exists(tenant_id, timeline_id, l1_found), "the L1 reappears"
wait_for_upload_queue_empty(pageserver_http, tenant_id, timeline_id)
@@ -141,3 +106,6 @@ def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin)
tenant_id, timeline_id, l1_found.to_str()
)
assert uploaded.exists(), "the L1 is uploaded"
# TODO: same test for L0s produced by ingest.

View File

@@ -163,11 +163,6 @@ def test_pageserver_chaos(
env = neon_env_builder.init_start(initial_tenant_shard_count=shard_count)
# these can happen, if we shutdown at a good time. to be fixed as part of #5172.
message = ".*duplicated L1 layer layer=.*"
for ps in env.pageservers:
ps.allowed_errors.append(message)
# Use a tiny checkpoint distance, to create a lot of layers quickly.
# That allows us to stress the compaction and layer flushing logic more.
tenant, _ = env.neon_cli.create_tenant(

View File

@@ -100,10 +100,6 @@ def test_location_conf_churn(neon_env_builder: NeonEnvBuilder, seed: int):
]
)
# these can happen, if we shutdown at a good time. to be fixed as part of #5172.
message = ".*duplicated L1 layer layer=.*"
ps.allowed_errors.append(message)
workload = Workload(env, tenant_id, timeline_id)
workload.init(env.pageservers[0].id)
workload.write_rows(256, env.pageservers[0].id)