mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-14 19:50:38 +00:00
# Refs - fixes https://github.com/neondatabase/neon/issues/11762 # Problem PR #10993 introduced internal retries for BufferedWriter flushes. PR #11052 added cancellation sensitivity to that retry loop. That cancellation sensitivity is an error path that didn't exist before. The result is that during timeline shutdown, after we `Timeline::cancel`, compaction can now fail with error `flush task cancelled`. The problem with that: 1. We mis-classify this as an `error!`-worthy event. 2. This causes tests to become flaky because the error is not in global `allowed_errors`. Technically we also trip the `compaction_circuit_breaker` because the resulting `CompactionError` is variant `::Other`. But since this is Timeline shutdown, is doesn't matter practically speaking. # Solution / Changes - Log the anyhow stack trace when classifying a compaction error as `error!`. This was helpful to identify sources of `flush task cancelled` errors. We only log at `error!` level in exceptional circumstances, so, it's ok to have bit verbose logs. - Introduce typed errors along the `BufferedWriter::write_*`=> `BlobWriter::write_blob` => `{Delta,Image}LayerWriter::put_*` => `Split{Delta,Image}LayerWriter::put_{value,image}` chain. - Proper mapping to `CompactionError`/`CreateImageLayersError` via new `From` impls. I am usually opposed to any magic `From` impls, but, it's how most of the compaction code works today. # Testing The symptoms are most prevalent in `test_runner/regress/test_branch_and_gc.py::test_branch_and_gc`. Before this PR, I was able to reproduce locally 1 or 2 times per 400 runs using `DEFAULT_PG_VERSION=15 BUILD_TYPE=release poetry run pytest --count 400 -n 8`. After this PR, it doesn't reproduce anymore after 2000 runs. # Future Work Technically the ingest path is also exposed to this new source of errors because `InMemoryLayer` is backed by `BufferedWriter`. But we haven't seen it occur in flaky tests yet. Details and a fix in - https://github.com/neondatabase/neon/pull/11851
25 lines
576 B
Rust
25 lines
576 B
Rust
use crate::tenant::blob_io::WriteBlobError;
|
|
|
|
#[derive(Debug, thiserror::Error)]
|
|
pub enum PutError {
|
|
#[error(transparent)]
|
|
WriteBlob(WriteBlobError),
|
|
#[error(transparent)]
|
|
Other(anyhow::Error),
|
|
}
|
|
|
|
impl PutError {
|
|
pub fn is_cancel(&self) -> bool {
|
|
match self {
|
|
PutError::WriteBlob(e) => e.is_cancel(),
|
|
PutError::Other(_) => false,
|
|
}
|
|
}
|
|
pub fn into_anyhow(self) -> anyhow::Error {
|
|
match self {
|
|
PutError::WriteBlob(e) => e.into_anyhow(),
|
|
PutError::Other(e) => e,
|
|
}
|
|
}
|
|
}
|