Files
neon/safekeeper/src/timeline_guard.rs
John Spray 33dce25af8 safekeeper: block deletion on protocol handler shutdown (#9364)
## Problem

Two recently observed log errors indicate safekeeper tasks for a
timeline running after that timeline's deletion has started.
- https://github.com/neondatabase/neon/issues/8972
- https://github.com/neondatabase/neon/issues/8974

These code paths do not have a mechanism that coordinates task shutdown
with the overall shutdown of the timeline.

## Summary of changes

- Add a `Gate` to `Timeline`
- Take the gate as part of resident timeline guard: any code that holds
a guard over a timeline staying resident should also hold a guard over
the timeline's total lifetime.
- Take the gate from the wal removal task
- Respect Timeline::cancel in WAL send/recv code, so that we do not
block shutdown indefinitely.
- Add a test that deletes timelines with open pageserver+compute
connections, to check these get torn down as expected.

There is some risk to introducing gates: if there is code holding a gate
which does not properly respect a cancellation token, it can cause
shutdown hangs. The risk of this for safekeepers is lower in practice
than it is for other services, because in a healthy timeline deletion,
the compute is shutdown first, then the timeline is deleted on the
pageserver, and finally it is deleted on the safekeepers -- that makes
it much less likely that some protocol handler will still be running.

Closes: #8972
Closes: #8974
2024-11-20 11:07:45 +00:00

83 lines
2.6 KiB
Rust

//! Timeline residence guard
//!
//! It is needed to ensure that WAL segments are present on disk,
//! as long as the code is holding the guard. This file implements guard logic, to issue
//! and drop guards, and to notify the manager when the guard is dropped.
use std::collections::HashSet;
use tracing::debug;
use utils::sync::gate::GateGuard;
use crate::timeline_manager::ManagerCtlMessage;
#[derive(Debug, Clone, Copy)]
pub struct GuardId(u64);
pub struct ResidenceGuard {
manager_tx: tokio::sync::mpsc::UnboundedSender<ManagerCtlMessage>,
guard_id: GuardId,
/// [`ResidenceGuard`] represents a guarantee that a timeline's data remains resident,
/// which by extension also means the timeline is not shut down (since after shut down
/// our data may be deleted). Therefore everyone holding a residence guard must also
/// hold a guard on [`crate::timeline::Timeline::gate`]
_gate_guard: GateGuard,
}
impl Drop for ResidenceGuard {
fn drop(&mut self) {
// notify the manager that the guard is dropped
let res = self
.manager_tx
.send(ManagerCtlMessage::GuardDrop(self.guard_id));
if let Err(e) = res {
debug!("failed to send GuardDrop message: {:?}", e);
}
}
}
/// AccessService is responsible for issuing and dropping residence guards.
/// All guards are stored in the `guards` set.
/// TODO: it's possible to add `String` name to each guard, for better observability.
pub(crate) struct AccessService {
next_guard_id: u64,
guards: HashSet<u64>,
manager_tx: tokio::sync::mpsc::UnboundedSender<ManagerCtlMessage>,
}
impl AccessService {
pub(crate) fn new(manager_tx: tokio::sync::mpsc::UnboundedSender<ManagerCtlMessage>) -> Self {
Self {
next_guard_id: 0,
guards: HashSet::new(),
manager_tx,
}
}
pub(crate) fn is_empty(&self) -> bool {
self.guards.is_empty()
}
/// `timeline_gate_guard` is a guarantee that the timeline is not shut down
pub(crate) fn create_guard(&mut self, timeline_gate_guard: GateGuard) -> ResidenceGuard {
let guard_id = self.next_guard_id;
self.next_guard_id += 1;
self.guards.insert(guard_id);
let guard_id = GuardId(guard_id);
debug!("issued a new guard {:?}", guard_id);
ResidenceGuard {
manager_tx: self.manager_tx.clone(),
guard_id,
_gate_guard: timeline_gate_guard,
}
}
pub(crate) fn drop_guard(&mut self, guard_id: GuardId) {
debug!("dropping guard {:?}", guard_id);
assert!(self.guards.remove(&guard_id.0));
}
}