mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-16 18:02:56 +00:00
In safekeepers we have several background tasks. Previously `WAL backup` task was spawned by another task called `wal_backup_launcher`. That task received notifications via `wal_backup_launcher_rx` and decided to spawn or kill existing backup task associated with the timeline. This was inconvenient because each code segment that touched shared state was responsible for pushing notification into `wal_backup_launcher_tx` channel. This was error prone because it's easy to miss and could lead to deadlock in some cases, if notification pushing was done in the wrong order. We also had a similar issue with `is_active` timeline flag. That flag was calculated based on the state and code modifying the state had to call function to update the flag. We had a few bugs related to that, when we forgot to update `is_active` flag in some places where it could change. To fix these issues, this PR adds a new `timeline_manager` background task associated with each timeline. This task is responsible for managing all background tasks, including `is_active` flag which is used for pushing broker messages. It is subscribed for updates in timeline state in a loop and decides to spawn/kill background tasks when needed. There is a new structure called `TimelinesSet`. It stores a set of `Arc<Timeline>` and allows to copy the set to iterate without holding the mutex. This is what replaced `is_active` flag for the broker. Now broker push task holds a reference to the `TimelinesSet` with active timelines and use it instead of iterating over all timelines and filtering by `is_active` flag. Also added some metrics for manager iterations and active backup tasks. Ideally manager should be doing not too many iterations and we should not have a lot of backup tasks spawned at the same time. Fixes #7751 --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
42 lines
1.2 KiB
Rust
42 lines
1.2 KiB
Rust
//! Thread removing old WAL.
|
|
|
|
use std::time::Duration;
|
|
|
|
use tokio::time::sleep;
|
|
use tracing::*;
|
|
|
|
use crate::{GlobalTimelines, SafeKeeperConf};
|
|
|
|
pub async fn task_main(_conf: SafeKeeperConf) -> anyhow::Result<()> {
|
|
let wal_removal_interval = Duration::from_millis(5000);
|
|
loop {
|
|
let now = tokio::time::Instant::now();
|
|
let tlis = GlobalTimelines::get_all();
|
|
for tli in &tlis {
|
|
let ttid = tli.ttid;
|
|
async {
|
|
if let Err(e) = tli.maybe_persist_control_file().await {
|
|
warn!("failed to persist control file: {e}");
|
|
}
|
|
if let Err(e) = tli.remove_old_wal().await {
|
|
error!("failed to remove WAL: {}", e);
|
|
}
|
|
}
|
|
.instrument(info_span!("WAL removal", ttid = %ttid))
|
|
.await;
|
|
}
|
|
|
|
let elapsed = now.elapsed();
|
|
let total_timelines = tlis.len();
|
|
|
|
if elapsed > wal_removal_interval {
|
|
info!(
|
|
"WAL removal is too long, processed {} timelines in {:?}",
|
|
total_timelines, elapsed
|
|
);
|
|
}
|
|
|
|
sleep(wal_removal_interval).await;
|
|
}
|
|
}
|